Combined and transfer learning of a variant pathogenicity predictor using gapped and non-gapped protein samples

ABSTRACT

The technology disclosed relates to training a pathogenicity predictor. In particular, the technology disclosed relates to accessing a gapped training set that includes respective gapped protein samples for respective positions in a proteome, accessing a non-gapped training set that includes non-gapped benign protein samples and non-gapped pathogenic protein samples, generating respective gapped spatial representations for the gapped protein samples, and generating respective non-gapped spatial representations for the non-gapped benign protein samples and the non-gapped pathogenic protein samples, training a pathogenicity predictor over one or more training cycles and generating a trained pathogenicity predictor, wherein each of the training cycles uses as training examples gapped spatial representations from the respective gapped spatial representations and non-gapped spatial representations from the respective non-gapped spatial representations, and using the trained pathogenicity classifier to determine pathogenicity of variants.

PRIORITY APPLICATIONS

This application claims the benefit of U.S. Provisional PatentApplication No. 63/253,122, titled “PROTEIN STRUCTURE-BASED PROTEINLANGUAGE MODELS,” filed Oct. 6, 2021 (Attorney Docket No. ILLM1050-1/IP-2164-PRV). The priority provisional application is herebyincorporated by reference for all purposes.

This application claims the benefit of U.S. Provisional PatentApplication No. 63/281,579, titled “PREDICTING VARIANT PATHOGENICITYFROM EVOLUTIONARY CONSERVATION USING THREE-DIMENSIONAL (3D) PROTEINSTRUCTURE VOXELS,” filed Nov. 19, 2021 (Attorney Docket No. ILLM1060-1/IP-2270-PRV). The priority provisional application is herebyincorporated by reference for all purposes.

This application claims the benefit of U.S. Provisional PatentApplication No. 63/281,592, titled “COMBINED AND TRANSFER LEARNING OF AVARIANT PATHOGENICITY PREDICTOR USING GAPED AND NON-GAPED PROTEINSAMPLES,” filed Nov. 19, 2021 (Attorney Docket No. ILLM1061-1/IP-2271-PRV). The priority provisional application is herebyincorporated by reference for all purposes.

FIELD OF THE TECHNOLOGY DISCLOSED

The technology disclosed relates to artificial intelligence typecomputers and digital data processing systems and corresponding dataprocessing methods and products for emulation of intelligence (i.e.,knowledge based systems, reasoning systems, and knowledge acquisitionsystems); and including systems for reasoning with uncertainty (e.g.,fuzzy logic systems), adaptive systems, machine learning systems, andartificial neural networks. In particular, the technology disclosedrelates to using deep convolutional neural networks to analyzemulti-channel voxelized data.

INCORPORATIONS

The following are incorporated by reference for all purposes as if fullyset forth herein:

U.S. Nonprovisional patent application Ser. No. 17/953,286, titled“PREDICTING VARIANT PATHOGENICITY FROM EVOLUTIONARY CONSERVATION USINGTHREE-DIMENSIONAL (3D) PROTEIN STRUCTURE VOXELS,” FILED ON Sep. 26, 2022(Attorney Docket No. ILLM 1060-2270-US);

U.S. Nonprovisional patent application Ser. No. 17/533,091, titled“PROTEIN STRUCTURE-BASED PROTEIN STRUCTURE VOXELS,” FILED ON Sep. 22,2021 (Attorney Docket No. ILLM 1050-2/IP-2164-US);

Sundaram, L et al. Predicting the clinical impact of human mutation withdeep neural networks. Nat. Genet. 50, 1161-1170 (2018);

Jaganathan, K. et al. Predicting splicing from primary sequence withdeep learning. Cell 176, 535-548 (2019);

U.S. Patent Application No. 62/573,144, titled “TRAINING A DEEPPATHOGENICITY CLASSIFIER USING LARGE-SCALE BENIGN TRAINING DATA,” filedOct. 16, 2017 (Attorney Docket No. ILLM 1000-1/IP-1611-PRV);

U.S. Patent Application No. 62/573,149, titled “PATHOGENICITY CLASSIFIERBASED ON DEEP CONVOLUTIONAL NEURAL NETWORKS (CNNs),” filed Oct. 16, 2017(Attorney Docket No. ILLM 1000-2/IP-1612-PRV);

U.S. Patent Application No. 62/573,153, titled “DEEP SEMI-SUPERVISEDLEARNING THAT GENERATES LARGE-SCALE PATHOGENIC TRAINING DATA,” filedOct. 16, 2017 (Attorney Docket No. ILLM 1000-3/IP-1613-PRV);

U.S. Patent Application No. 62/582,898, titled “PATHOGENICITYCLASSIFICATION OF GENOMIC DATA USING DEEP CONVOLUTIONAL NEURAL NETWORKS(CNNs),” filed Nov. 7, 2017 (Attorney Docket No. ILLM1000-4/IP-1618-PRV);

U.S. patent application Ser. No. 16/160,903, titled “DEEP LEARNING-BASEDTECHNIQUES FOR TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS,” filed onOct. 15, 2018 (Attorney Docket No. ILLM 1000-5/IP-1611-US);

U.S. patent application Ser. No. 16/160,986, titled “DEEP CONVOLUTIONALNEURAL NETWORKS FOR VARIANT CLASSIFICATION,” filed on Oct. 15, 2018(Attorney Docket No. ILLM 1000-6/IP-1612-US);

U.S. patent application Ser. No. 16/160,968, titled “SEMI-SUPERVISEDLEARNING FOR TRAINING AN ENSEMBLE OF DEEP CONVOLUTIONAL NEURALNETWORKS,” filed on Oct. 15, 2018 (Attorney Docket No. ILLM1000-7/IP-1613-US);

U.S. patent application Ser. No. 16/407,149, titled “DEEP LEARNING-BASEDTECHNIQUES FOR PRE-TRAINING DEEP CONVOLUTIONAL NEURAL NETWORKS,” filedMay 8, 2019 (Attorney Docket No. ILLM 1010-1/IP-1734-US);

U.S. patent application Ser. No. 17/232,056, titled “DEEP CONVOLUTIONALNEURAL NETWORKS TO PREDICT VARIANT PATHOGENICITY USING THREE-DIMENSIONAL(3D) PROTEIN STRUCTURES,” filed on Apr. 15, 2021, (Atty. Docket No. ILLM1037-2/IP-2051-US);

U.S. Patent Application No. 63/175,495, titled “MULTI-CHANNEL PROTEINVOXELIZATION TO PREDICT VARIANT PATHOGENICITY USING DEEP CONVOLUTIONALNEURAL NETWORKS,” filed on Apr. 15, 2021, (Atty. Docket No. ILLM1047-1/IP-2142-PRV);

U.S. Patent Application No. 63/175,767, titled “EFFICIENT VOXELIZATIONFOR DEEP LEARNING,” filed on Apr. 16, 2021, (Atty. Docket No. ILLM1048-1/IP-2143-PRV); and

U.S. patent application Ser. No. 17/468,411, titled “ARTIFICIALINTELLIGENCE-BASED ANALYSIS OF PROTEIN THREE-DIMENSIONAL (3D)STRUCTURES,” filed on Sep. 7, 2021, (Atty. Docket No. ILLM1037-3/IP-2051A-US).

BACKGROUND

The subject matter discussed in this section should not be assumed to beprior art merely as a result of its mention in this section. Similarly,a problem mentioned in this section or associated with the subjectmatter provided as background should not be assumed to have beenpreviously recognized in the prior art. The subject matter in thissection merely represents different approaches, which in and ofthemselves can also correspond to implementations of the claimedtechnology.

Genomics, in the broad sense, also referred to as functional genomics,aims to characterize the function of every genomic element of anorganism by using genome-scale assays such as genome sequencing,transcriptome profiling and proteomics. Genomics arose as a data-drivenscience—it operates by discovering novel properties from explorations ofgenome-scale data rather than by testing preconceived models andhypotheses. Applications of genomics include finding associationsbetween genotype and phenotype, discovering biomarkers for patientstratification, predicting the function of genes, and chartingbiochemically active genomic regions such as transcriptional enhancers.

Genomics data are too large and too complex to be mined solely by visualinvestigation of pairwise correlations. Instead, analytical tools arerequired to support the discovery of unanticipated relationships, toderive novel hypotheses and models and to make predictions. Unlike somealgorithms, in which assumptions and domain expertise are hard coded,machine learning algorithms are designed to automatically detectpatterns in data. Hence, machine learning algorithms are suited todata-driven sciences and, in particular, to genomics. However, theperformance of machine learning algorithms can strongly depend on howthe data are represented, that is, on how each variable (also called afeature) is computed. For instance, to classify a tumor as malign orbenign from a fluorescent microscopy image, a preprocessing algorithmcould detect cells, identify the cell type, and generate a list of cellcounts for each cell type.

A machine learning model can take the estimated cell counts, which areexamples of handcrafted features, as input features to classify thetumor. A central issue is that classification performance dependsheavily on the quality and the relevance of these features. For example,relevant visual features such as cell morphology, distances betweencells or localization within an organ are not captured in cell counts,and this incomplete representation of the data may reduce classificationaccuracy.

Deep learning, a subdiscipline of machine learning, addresses this issueby embedding the computation of features into the machine learning modelitself to yield end-to-end models. This outcome has been realizedthrough the development of deep neural networks, machine learning modelsthat comprise successive elementary operations, which computeincreasingly more complex features by taking the results of precedingoperations as input. Deep neural networks are able to improve predictionaccuracy by discovering relevant features of high complexity, such asthe cell morphology and spatial organization of cells in the aboveexample. The construction and training of deep neural networks have beenenabled by the explosion of data, algorithmic advances, and substantialincreases in computational capacity, particularly through the use ofgraphical processing units (GPUs).

The goal of supervised learning is to obtain a model that takes featuresas input and returns a prediction for a so-called target variable. Anexample of a supervised learning problem is one that predicts whether anintron is spliced out or not (the target) given features on the RNA suchas the presence or absence of the canonical splice site sequence, thelocation of the splicing branchpoint or intron length. Training amachine learning model refers to learning its parameters, which commonlyinvolves minimizing a loss function on training data with the aim ofmaking accurate predictions on unseen data.

For many supervised learning problems in computational biology, theinput data can be represented as a table with multiple columns, orfeatures, each of which contains numerical or categorical data that arepotentially useful for making predictions. Some input data are naturallyrepresented as features in a table (such as temperature or time),whereas other input data need to be first transformed (such asdeoxyribonucleic acid (DNA) sequence into k-mer counts) using a processcalled feature extraction to fit a tabular representation. For theintron-splicing prediction problem, the presence or absence of thecanonical splice site sequence, the location of the splicing branchpointand the intron length can be preprocessed features collected in atabular format. Tabular data are standard for a wide range of supervisedmachine learning models, ranging from simple linear models, such aslogistic regression, to more flexible nonlinear models, such as neuralnetworks and many others.

Logistic regression is a binary classifier, that is, a supervisedlearning model that predicts a binary target variable. Specifically,logistic regression predicts the probability of the positive class bycomputing a weighted sum of the input features mapped to the [0, 1]interval using the sigmoid function, a type of activation function. Theparameters of logistic regression, or other linear classifiers that usedifferent activation functions, are the weights in the weighted sum.Linear classifiers fail when the classes, for instance, that of anintron spliced out or not, cannot be well discriminated with a weightedsum of input features. To improve predictive performance, new inputfeatures can be manually added by transforming or combining existingfeatures in new ways, for example, by taking powers or pairwiseproducts.

Neural networks use hidden layers to learn these nonlinear featuretransformations automatically. Each hidden layer can be thought of asmultiple linear models with their output transformed by a nonlinearactivation function, such as the sigmoid function or the more popularrectified-linear unit (ReLU). Together, these layers compose the inputfeatures into relevant complex patterns, which facilitates the task ofdistinguishing two classes.

Deep neural networks use many hidden layers, and a layer is said to befully-connected when each neuron receives inputs from all neurons of thepreceding layer. Neural networks are commonly trained using stochasticgradient descent, an algorithm suited to training models on very largedata sets. Implementation of neural networks using modern deep learningframeworks enables rapid prototyping with different architectures anddata sets. Fully-connected neural networks can be used for a number ofgenomics applications, which include predicting the percentage of exonsspliced in for a given sequence from sequence features such as thepresence of binding motifs of splice factors or sequence conservation;prioritizing potential disease-causing genetic variants; and predictingcis-regulatory elements in a given genomic region using features such aschromatin marks, gene expression and evolutionary conservation.

Local dependencies in spatial and longitudinal data must be consideredfor effective predictions. For example, shuffling a DNA sequence or thepixels of an image severely disrupts informative patterns. These localdependencies set spatial or longitudinal data apart from tabular data,for which the ordering of the features is arbitrary. Consider theproblem of classifying genomic regions as bound versus unbound by aparticular transcription factor, in which bound regions are defined ashigh-confidence binding events in chromatin immunoprecipitationfollowing by sequencing (ChIP-seq) data. Transcription factors bind toDNA by recognizing sequence motifs. A fully-connected layer based onsequence-derived features, such as the number of k-mer instances or theposition weight matrix (PWM) matches in the sequence, can be used forthis task. As k-mer or PWM instance frequencies are robust to shiftingmotifs within the sequence, such models could generalize well tosequences with the same motifs located at different positions. However,they would fail to recognize patterns in which transcription factorbinding depends on a combination of multiple motifs with well-definedspacing. Furthermore, the number of possible k-mers increasesexponentially with k-mer length, which poses both storage andoverfitting challenges.

A convolutional layer is a special form of fully-connected layer inwhich the same fully-connected layer is applied locally, for example, ina 6 bp window, to all sequence positions. This approach can also beviewed as scanning the sequence using multiple PWMs, for example, fortranscription factors GATA1 and TAL1. By using the same model parametersacross positions, the total number of parameters is drastically reduced,and the network is able to detect a motif at positions not seen duringtraining. Each convolutional layer scans the sequence with severalfilters by producing a scalar value at every position, which quantifiesthe match between the filter and the sequence. As in fully-connectedneural networks, a nonlinear activation function (commonly ReLU) isapplied at each layer. Next, a pooling operation is applied, whichaggregates the activations in contiguous bins across the positionalaxis, commonly taking the maximal or average activation for eachchannel. Pooling reduces the effective sequence length and coarsens thesignal. The subsequent convolutional layer composes the output of theprevious layer and is able to detect whether a GATA1 motif and TAL1motif were present at some distance range. Finally, the output of theconvolutional layers can be used as input to a fully-connected neuralnetwork to perform the final prediction task. Hence, different types ofneural network layers (e.g., fully-connected layers and convolutionallayers) can be combined within a single neural network.

Convolutional neural networks (CNNs) can predict various molecularphenotypes on the basis of DNA sequence alone. Applications includeclassifying transcription factor binding sites and predicting molecularphenotypes such as chromatin features, DNA contact maps, DNAmethylation, gene expression, translation efficiency, RBP binding, andmicroRNA (miRNA) targets. In addition to predicting molecular phenotypesfrom the sequence, convolutional neural networks can be applied to moretechnical tasks traditionally addressed by handcrafted bioinformaticspipelines. For example, convolutional neural networks can predict thespecificity of guide RNA, denoise ChIP—seq, enhance Hi-C dataresolution, predict the laboratory of origin from DNA sequences and callgenetic variants. Convolutional neural networks have also been employedto model long-range dependencies in the genome. Although interactingregulatory elements may be distantly located on the unfolded linear DNAsequence, these elements are often proximal in the actual 3D chromatinconformation. Hence, modeling molecular phenotypes from the linear DNAsequence, albeit a crude approximation of the chromatin, can be improvedby allowing for long-range dependencies and allowing the model toimplicitly learn aspects of the 3D organization, such aspromoter-enhancer looping. This is achieved by using dilatedconvolutions, which have a receptive field of up to 32 kb. Dilatedconvolutions also allow splice sites to be predicted from sequence usinga receptive field of 10 kb, thereby enabling the integration of geneticsequence across distances as long as typical human introns (SeeJaganathan, K. et al. Predicting splicing from primary sequence withdeep learning. Cell 176, 535-548 (2019)).

Different types of neural network can be characterized by theirparameter-sharing schemes. For example, fully-connected layers have noparameter sharing, whereas convolutional layers impose translationalinvariance by applying the same filters at every position of theirinput. Recurrent neural networks (RNNs) are an alternative toconvolutional neural networks for processing sequential data, such asDNA sequences or time series, that implement a differentparameter-sharing scheme. Recurrent neural networks apply the sameoperation to each sequence element. The operation takes as input thememory of the previous sequence element and the new input. It updatesthe memory and optionally emits an output, which is either passed on tosubsequent layers or is directly used as model predictions. By applyingthe same model at each sequence element, recurrent neural networks areinvariant to the position index in the processed sequence. For example,a recurrent neural network can detect an open reading frame in a DNAsequence regardless of the position in the sequence. This task requiresthe recognition of a certain series of inputs, such as the start codonfollowed by an in-frame stop codon.

The main advantage of recurrent neural networks over convolutionalneural networks is that they are, in theory, able to carry overinformation through infinitely long sequences via memory. Furthermore,recurrent neural networks can naturally process sequences of widelyvarying length, such as mRNA sequences. However, convolutional neuralnetworks combined with various tricks (such as dilated convolutions) canreach comparable or even better performances than recurrent neuralnetworks on sequence-modeling tasks, such as audio synthesis and machinetranslation. Recurrent neural networks can aggregate the outputs ofconvolutional neural networks for predicting single-cell DNA methylationstates, RBP binding, transcription factor binding, and DNAaccessibility. Moreover, because recurrent neural networks apply asequential operation, they cannot be easily parallelized and are hencemuch slower to compute than convolutional neural networks.

Each human has a unique genetic code, though a large portion of thehuman genetic code is common for all humans In some cases, a humangenetic code may include an outlier, called a genetic variant, that maybe common among individuals of a relatively small group of the humanpopulation. For example, a particular human protein may comprise aspecific sequence of amino acids, whereas a variant of that protein maydiffer by one amino acid in the otherwise same specific sequence.

Genetic variants may be pathogenetic, leading to diseases. Though mostof such genetic variants have been depleted from genomes by naturalselection, an ability to identify which genetic variants are likely tobe pathogenic can help researchers focus on these genetic variants togain an understanding of the corresponding diseases and theirdiagnostics, treatments, or cures. The clinical interpretation ofmillions of human genetic variants remains unclear. Some of the mostfrequent pathogenic variants are single nucleotide missense mutationsthat change the amino acid of a protein. However, not all missensemutations are pathogenic.

Models that can predict molecular phenotypes directly from biologicalsequences can be used as in silico perturbation tools to probe theassociations between genetic variation and phenotypic variation and haveemerged as new methods for quantitative trait loci identification andvariant prioritization. These approaches are of major importance giventhat the majority of variants identified by genome-wide associationstudies of complex phenotypes are non-coding, which makes it challengingto estimate their effects and contribution to phenotypes. Moreover,linkage disequilibrium results in blocks of variants being co-inherited,which creates difficulties in pinpointing individual causal variants.Thus, sequence-based deep learning models that can be used asinterrogation tools for assessing the impact of such variants offer apromising approach to find potential drivers of complex phenotypes. Oneexample includes predicting the effect of non-coding single-nucleotidevariants and short insertions or deletions (indels) indirectly from thedifference between two variants in terms of transcription factorbinding, chromatin accessibility or gene expression predictions. Anotherexample includes predicting novel splice site creation from sequence orquantitative effects of genetic variants on splicing.

End-to-end deep learning approaches for variant effect predictions areapplied to predict the pathogenicity of missense variants from proteinsequence and sequence conservation data (See Sundaram, L et al.Predicting the clinical impact of human mutation with deep neuralnetworks. Nat. Genet. 50, 1161-1170 (2018), referred to herein as“PrimateAI”). PrimateAI uses deep neural networks trained on variants ofknown pathogenicity with data augmentation using cross-speciesinformation. In particular, PrimateAI uses sequences of wild-type andmutant proteins to compare the difference and decide the pathogenicityof mutations using the trained deep neural networks. Such an approachwhich utilizes the protein sequences for pathogenicity prediction ispromising because it can avoid the circularity problem and overfittingto previous knowledge. However, compared to the adequate number of datato train the deep neural networks effectively, the number of clinicaldata available in ClinVar is relatively small. To overcome this datascarcity, PrimateAI uses common human variants and variants fromprimates as benign data while simulated variants based on trinucleotidecontext were used as unlabeled data.

PrimateAI outperforms prior methods when trained directly upon sequencealignments. PrimateAI learns important protein domains, conserved aminoacid positions, and sequence dependencies directly from the trainingdata consisting of about 120,000 human samples. PrimateAI substantiallyexceeds the performance of other variant pathogenicity prediction toolsin differentiating benign and pathogenic de-novo mutations in candidatedevelopmental disorder genes, and in reproducing prior knowledge inClinVar. These results suggest that PrimateAI is an important stepforward for variant classification tools that may lessen the reliance ofclinical reporting on prior knowledge.

Central to protein biology is the understanding of how structuralelements give rise to observed function. The surfeit of proteinstructural data enables development of computational methods tosystematically derive rules governing structural-functionalrelationships. However, performance of these methods depends criticallyon the choice of protein structural representation.

Protein sites are microenvironments within a protein structure,distinguished by their structural or functional role. A site can bedefined by a three-dimensional (3D) location and a local neighborhoodaround this location in which the structure or function exists. Centralto rational protein engineering is the understanding of how thestructural arrangement of amino acids creates functional characteristicswithin protein sites. Determination of the structural and functionalroles of individual amino acids within a protein provides information tohelp engineer and alter protein functions. Identifying functionally orstructurally important amino acids allows focused engineering effortssuch as site-directed mutagenesis for altering targeted proteinfunctional properties. Alternatively, this knowledge can help avoidengineering designs that would abolish a desired function.

Since it has been established that structure is far more conserved thansequence, the increase in protein structural data provides anopportunity to systematically study the underlying pattern governing thestructural-functional relationships using data-driven approaches. Afundamental aspect of any computational protein analysis is how proteinstructural information is represented. The performance of machinelearning methods often depends more on the choice of data representationthan the machine learning algorithm employed. Good representationsefficiently capture the most critical information while poorrepresentations create a noisy distribution with no 0underlyingpatterns.

The surfeit of protein structures and the recent success of deeplearning algorithms provide an opportunity to develop tools forautomatically extracting task specific representations of proteinstructures. Therefore, an opportunity arises to predict variantpathogenicity using multi-channel voxelized representations of 3Dprotein structures as input to deep neural networks.

BRIEF DESCRIPTION OF THE DRAWINGS

The patent or application file contains at least one drawing executed incolor. Copies of this patent or patent application publication withcolor drawing(s) will be provided by the Office upon request and paymentof the necessary fee.

The color drawings also may be available in PAIR via the SupplementalContent tab. In the drawings, like reference characters generally referto like parts throughout the different views. Also, the drawings are notnecessarily to scale, with an emphasis instead generally being placedupon illustrating the principles of the technology disclosed. In thefollowing description, various implementations of the technologydisclosed are described with reference to the following drawings, inwhich.

FIG. 1 is a flow diagram that illustrates a process of a system fordetermining pathogenicity of variants, according to variousimplementations of the technology disclosed.

FIG. 2 schematically illustrates an example reference amino acidsequence of a protein and an alternative amino acid sequence of theprotein, in accordance with one implementation of the technologydisclosed.

FIG. 3 illustrates amino acid-wise classification of atoms of aminoacids in the reference amino acid sequence of FIG. 2 , in accordancewith one implementation of the technology disclosed.

FIG. 4 illustrates amino acid-wise attribution of 3D atomic coordinatesof the alpha-carbon atoms classified in FIG. 3 on an amino acid-basis,in accordance with one implementation of the technology disclosed.

FIG. 5 schematically illustrates a process of determining voxel-wisedistance values, in accordance with one implementation of the technologydisclosed.

FIG. 6 shows an example of twenty-one amino acid-wise distance channels,in accordance with one implementation of the technology disclosed.

FIG. 7 is a schematic diagram of a distance channel tensor, inaccordance with one implementation of the technology disclosed.

FIG. 8 shows one-hot encodings of the reference amino acid and thealternative amino acid from FIG. 2 , in accordance with oneimplementation of the technology disclosed.

FIG. 9 is a schematic diagram of a voxelized one-hot encoded referenceamino acid and a voxelized one-hot encoded variant/alternative aminoacid, in accordance with one implementation of the technology disclosed.

FIG. 10 schematically illustrates a concatenation process thatvoxel-wise concatenates the distance channel tensor of FIG. 7 and areference allele tensor, in accordance with one implementation of thetechnology disclosed.

FIG. 11 schematically illustrates a concatenation process thatvoxel-wise concatenates the distance channel tensor of FIG. 7 , thereference allele tensor of FIG. 10 , and an alternative allele tensor,in accordance with one implementation of the technology disclosed.

FIG. 12 is a flow diagram that illustrates a process of a system fordetermining and assigning pan-amino acid conservation frequencies ofnearest atoms to voxels (voxelizing), in accordance with oneimplementation of the technology disclosed.

FIG. 13 illustrates voxels-to-nearest amino acids, in accordance withone implementation of the technology disclosed.

FIG. 14 shows an example multi-sequence alignment of the reference aminoacid sequence across a ninety-nine species, in accordance with oneimplementation of the technology disclosed.

FIG. 15 shows an example of determining a pan-amino acid conservationfrequencies sequence for a particular voxel, in accordance with oneimplementation of the technology disclosed.

FIG. 16 shows respective pan-amino acid conservation frequenciesdetermined for respective voxels using the position frequency logicdescribed in FIG. 15 , in accordance with one implementation of thetechnology disclosed.

FIG. 17 illustrates voxelized per-voxel evolutionary profiles, inaccordance with one implementation of the technology disclosed.

FIG. 18 depicts an example of an evolutionary profiles tensor, inaccordance with one implementation of the technology disclosed.

FIG. 19 is a flow diagram that illustrates a process of a system fordetermining and assigning per-amino acid conservation frequencies ofnearest atoms to voxels (voxelizing), in accordance with oneimplementation of the technology disclosed.

FIG. 20 shows various examples of voxelized annotation channels that areconcatenated with the distance channel tensor, in accordance with oneimplementation of the technology disclosed.

FIG. 21 illustrates different combinations and permutations of inputchannels that can be provided as inputs to a pathogenicity classifierfor pathogenicity determination of a target variant, in accordance withone implementation of the technology disclosed.

FIG. 22 shows different methods of calculating the disclosed distancechannels, in accordance with various implementations of the technologydisclosed.

FIG. 23 shows different examples of the evolutionary channels, inaccordance with various implementations of the technology disclosed.

FIG. 24 shows different examples of the annotations channels, inaccordance with various implementations of the technology disclosed.

FIG. 25 shows different examples of the structure confidence channels,in accordance with various implementations of the technology disclosed.

FIG. 26 shows an example processing architecture of the pathogenicityclassifier, in accordance with one implementation of the technologydisclosed.

FIG. 27 shows an example processing architecture of the pathogenicityclassifier, in accordance with one implementation of the technologydisclosed.

FIGS. 28, 29, 30, and 31 use PrimateAI as a benchmark model todemonstrate the disclosed PrimateAI 3D's classification superiority overPrimateAI.

FIGS. 32A and 32B show the disclosed efficient voxelization process, inaccordance with various implementations of the technology disclosed.

FIG. 33 depicts how atoms are associated with voxels that contain theatoms, in accordance with one implementation of the technologydisclosed.

FIG. 34 shows generating voxel-to-atoms mapping from atom-to-voxelsmapping to identify nearest atoms on a voxel-by-voxel basis, inaccordance with one implementation of the technology disclosed.

FIGS. 35A and 35B illustrate how the disclosed efficient voxelizationhas a runtime complexity of O(#atoms) versus the runtime complexity ofO(#atoms*#voxels) without the use of disclosed efficient voxelization

FIG. 36 shows an example computer system that can be used to implementthe technology disclosed.

FIG. 37 illustrates one implementation of determining variantpathogenicity for a target alternate amino acid based on processing agapped protein spatial representation.

FIG. 38 shows an example of a spatial representation of a protein.

FIG. 39 shows an example of a gapped spatial representation of theprotein illustrated in FIG. 38 .

FIG. 40 shows an example of an atomic spatial representation of theprotein illustrated in FIG. 38 .

FIG. 41 shows an example of a gapped atomic spatial representation ofthe protein illustrated in FIG. 38 .

FIG. 42 illustrates one implementation of a pathogenicity classifierdetermining variant pathogenicity for a target alternate amino acidbased on processing a gapped protein spatial representation and analternate amino acid representation of the target alternate amino acid.

FIG. 43 depicts one implementation of training data used to train thepathogenicity classifier.

FIG. 44 illustrates one implementation of generating gapped spatialrepresentations for reference proteins samples by using reference aminoacids as gap amino acids.

FIG. 45 shows one implementation of training the pathogenicityclassifier on benign protein samples.

FIG. 46 shows one implementation of training the pathogenicityclassifier on pathogenic protein samples.

FIG. 47 shows how certain unreachable amino acid classes are maskedduring training.

FIG. 48 illustrates one implementation of determining a finalpathogenicity score.

FIG. 49A shows that a variant pathogenicity determination is made for atarget alternate amino acid filling a vacancy created by a reference gapamino acid at a given position in a protein.

FIG. 49B shows that respective variant pathogenicity determinations aremade for amino acids of respective amino acid classes filing the vacancycreated by the reference gap amino acid at the given position in theprotein.

FIG. 50 illustrates one implementation of determining variantpathogenicity for multiple alternate amino acids based on processing agapped protein spatial representation.

FIG. 51 illustrates one implementation of the pathogenicity classifierdetermining variant pathogenicity for multiple alternate amino acidsbased on processing a gapped protein spatial representation.

FIG. 52 illustrates one implementation of concurrently training thepathogenicity classifier on benign and pathogenic protein samples.

FIG. 53 illustrates one implementation of determining variantpathogenicity for multiple alternate amino acids based on processing agapped protein spatial representation and, in response, generatingevolutionary conservation scores for the multiple alternate amino acids.

FIG. 54 shows the evolutionary conservation determiner in operation, inaccordance with one implementation.

FIG. 55 illustrates one implementation of determining pathogenicitybased on predicted evolutionary scores.

FIG. 56 illustrates one implementation of training data used to trainthe evolutionary conservation determiner.

FIG. 57 illustrates one implementation of concurrently training theevolutionary conservation determiner on benign and pathogenic proteinsamples.

FIG. 58 depicts different implementations of ground truth labelencodings used to train the evolutionary conservation determiner.

FIG. 59 illustrates an example position-specific frequency matrix(PSFM).

FIG. 60 depicts an example position-specific scoring matrix (PSSM).

FIG. 61 shows one implementation of generating the PSFM and the PSSM.

FIG. 62 illustrates an example PSFM encoding.

FIG. 63 depicts an example PSSM encoding.

FIG. 64 illustrates two datasets on which the models disclosed hereincan be trained.

FIGS. 65A-65B illustrate one implementation of combined learning of themodels disclosed herein.

FIGS. 66A-66B illustrate one implementation of using transfer learningto train the models disclosed herein using the two datasets shown inFIG. 64 .

FIG. 67 shows one implementation of generating training data and labelsto train the models disclosed herein.

FIG. 68 illustrates one implementation of a method of determiningpathogenicity of nucleotide variants.

FIG. 69 illustrates one implementation of a system to predict structuraltolerability of amino acid substitutes.

FIG. 70 depicts performance results that demonstrate objective indiciaof non-obviousness and inventiveness.

DETAILED DESCRIPTION

The following discussion is presented to enable any person skilled inthe art to make and use the technology disclosed and is provided in thecontext of a particular application and its requirements. Variousmodifications to the disclosed implementations will be readily apparentto those skilled in the art, and the general principles defined hereinmay be applied to other implementations and applications withoutdeparting from the spirit and scope of the technology disclosed. Thus,the technology disclosed is not intended to be limited to theimplementations shown but is to be accorded the widest scope consistentwith the principles and features disclosed herein.

The detailed description of various implementations will be betterunderstood when read in conjunction with the appended drawings. To theextent that the figures illustrate diagrams of the functional blocks ofthe various implementations, the functional blocks are not necessarilyindicative of the division between hardware circuitry. Thus, forexample, one or more of the functional blocks (e.g., modules,processors, or memories) may be implemented in a single piece ofhardware (e.g., a general purpose signal processor or a block of randomaccess memory, hard disk, or the like) or multiple pieces of hardware.Similarly, the programs may be stand-alone programs, may be incorporatedas subroutines in an operating system, may be functions in an installedsoftware package, and the like. It should be understood that the variousimplementations are not limited to the arrangements and instrumentalityshown in the drawings.

The processing engines and databases of the figures, designated asmodules, can be implemented in hardware or software, and need not bedivided up in precisely the same blocks as shown in the figures. Some ofthe modules can also be implemented on different processors, computers,or servers, or spread among a number of different processors, computers,or servers. In addition, it will be appreciated that some of the modulescan be combined, operated in parallel or in a different sequence thanthat shown in the figures without affecting the functions achieved. Themodules in the figures can also be thought of as flowchart steps in amethod. A module also need not necessarily have all its code disposedcontiguously in memory; some parts of the code can be separated fromother parts of the code with code from other modules or other functionsdisposed in between.

Protein Structure-Based Pathogenicity Determination

FIG. 1 is a flow diagram that illustrates a process 100 of a system fordetermining pathogenicity of variants. At step 102, a sequence accessor104 of the system accesses reference and alternative amino acidsequences. At 112, a 3D structure generator 114 of the system generates3D protein structures for a reference amino acid sequence. In someimplementations, the 3D protein structures are homology models of humanproteins. In one implementation, a so-called SwissModel homologymodelling pipeline provides a public repository of predicted humanprotein structures. In another implementation, a so-called HHpredhomology modelling uses a tool called Modeller to predict the structureof a target protein from template structures.

Proteins are represented by a collection of atoms and their coordinatesin 3D space. An amino acid can have a variety of atoms, such as carbonatoms, oxygen (O) atoms, nitrogen (N) atoms, and hydrogen (H) atoms. Theatoms can be further classified as side chain atoms and backbone atoms.The backbone carbon atoms can include alpha-carbon (C_(α)) atoms andbeta-carbon (C_(β)) atoms.

At step 122, a coordinate classifier 124 of the system classifies 3Datomic coordinates of the 3D protein structures on an amino acid-basis.In one implementation, the amino acid-wise classification involvesattributing the 3D atomic coordinates to the twenty-one amino acidcategories (including stop or gap amino acid category). In one example,an amino acid-wise classification of alpha-carbon atoms can respectivelylist alpha-carbon atoms under each of the twenty-one amino acidcategories. In another example, an amino acid-wise classification ofbeta-carbon atoms can respectively list beta-carbon atoms under each ofthe twenty-one amino acid categories.

In yet another example, an amino acid-wise classification of oxygenatoms can respectively list oxygen atoms under each of the twenty-oneamino acid categories. In yet another example, an amino acid-wiseclassification of nitrogen atoms can respectively list nitrogen atomsunder each of the twenty-one amino acid categories. In yet anotherexample, an amino acid-wise classification of hydrogen atoms canrespectively list hydrogen atoms under each of the twenty-one amino acidcategories.

A person skilled in the art will appreciate that, in variousimplementations, the amino acid-wise classification can include a subsetof the twenty-one amino acid categories and a subset of the differentatomic elements.

At step 132, a voxel grid generator 134 of the system instantiates avoxel grid. The voxel grid can have any resolution, for example, 3×3×3,5×5×5, 7×7×7, and so on. Voxels in the voxel grid can be of any size,for example, one angstrom (Å) on each side, two Å on each side, three Åon each side, and so on. One skilled in the art will appreciate thatthese example dimensions refer to cubic dimensions because voxels arecubes. Also, one skilled in the art will appreciate that these exampledimensions are non-limiting, and the voxels can have any cubicdimensions.

At step 142, a voxel grid centerer 144 of the system centers the voxelgrid at the reference amino acid experiencing a target variant at theamino acid level. In one implementation, the voxel grid is centered atan atomic coordinate of a particular atom of the reference amino acidexperiencing the target variant, for example, the 3D atomic coordinateof the alpha-carbon atom of the reference amino acid experiencing thetarget variant.

Distance Channels

The voxels in the voxel grid can have a plurality of channels (orfeatures). In one implementation, the voxels in the voxel grid have aplurality of distance channels (e.g., twenty-one distance channels forthe twenty-one amino acid categories, respectively (including stop orgap amino acid category)). At step 152, a distance channel generator 154of the system generates amino acid-wise distance channels for the voxelsin the voxel grid. The distance channels are independently generated foreach of the twenty-one amino acid categories.

Consider, for example, the Alanine (A) amino acid category. Furtherconsider, for example, that the voxel grid is of size 3×3×3 and hastwenty-seven voxels. Then, in one implementation, an Alanine distancechannel includes twenty-seven distance values for the twenty-sevenvoxels in the voxel grid, respectively. The twenty-seven distance valuesin the Alanine distance channel are measured from respective centers ofthe twenty-seven voxels in the voxel grid to respective nearest atoms inthe Alanine amino acid category.

In one example, the Alanine amino acid category includes onlyalpha-carbon atoms and therefore the nearest atoms are those Alaninealpha-carbon atoms that are most proximate to the twenty-seven voxels inthe voxel grid, respectively. In another example, the Alanine amino acidcategory includes only beta-carbon atoms and therefore the nearest atomsare those Alanine beta-carbon atoms that are most proximate to thetwenty-seven voxels in the voxel grid, respectively.

In yet another example, the Alanine amino acid category includes onlyoxygen atoms and therefore the nearest atoms are those Alanine oxygenatoms that are most proximate to the twenty-seven voxels in the voxelgrid, respectively. In yet another example, the Alanine amino acidcategory includes only nitrogen atoms and therefore the nearest atomsare those Alanine nitrogen atoms that are most proximate to thetwenty-seven voxels in the voxel grid, respectively. In yet anotherexample, the Alanine amino acid category includes only hydrogen atomsand therefore the nearest atoms are those Alanine hydrogen atoms thatare most proximate to the twenty-seven voxels in the voxel grid,respectively.

Like the Alanine distance channel, the distance channel generator 154generates a distance channel (i.e., a set of voxel-wise distance values)for each of the remaining amino acid categories. In otherimplementations, the distance channel generator 154 generates distancechannels only for a subset of the twenty-one amino acid categories.

In other implementations, the selection of the nearest atoms is notconfined to a particular atom type. That is, within a subject amino acidcategory, the nearest atom to a particular voxel is selected,irrespective of the atomic element of the nearest atom, and the distancevalue for the particular voxel calculated for inclusion in the distancechannel for the subject amino acid category.

In yet other implementations, the distance channels are generated on anatomic element-basis. Instead of or in addition to having the distancechannels for the amino acid categories, distance values can be generatedfor atom element categories, irrespective of the amino acids to whichthe atoms belong. Consider, for example, that the atoms of amino acidsin the reference amino acid sequence span seven atomic elements: carbon,oxygen, nitrogen, hydrogen, calcium, iodine, and sulfur. Then, thevoxels in the voxel grid are configured to have seven distance channels,such that each of the seven distance channels have twenty-seven voxelwise distance values that specify distances to nearest atoms only withina corresponding atomic element category. In other implementations,distance channels for only a subset of the seven atomic elements can begenerated. In yet other implementations, the atomic element categoriesand the distance channel generation can be further stratified intovariations of a same atomic element, for example, alpha-carbon (C_(α))atoms and beta-carbon (C_(β)) atoms.

In yet other implementations, the distance channels can be generated onan atom type-basis, for example, distance channels only for side chainatoms and distance channels only for backbone atoms.

The nearest atoms can be searched within a predefined maximum scanradius from the voxel centers (e.g., six angstrom (Å)). Also, multipleatoms can be nearest to a same voxel in the voxel grid.

The distances are calculated between 3D coordinates of the voxel centersand 3D atomic coordinates of the atoms. Also, the distance channels aregenerated with the voxel grid centered at a same location (e.g.,centered at the 3D atomic coordinate of the alpha-carbon atom of thereference amino acid experiencing the target variant).

The distances can be Euclidean distances. Also, the distances can beparameterized by atom size (or atom influence) (e.g., by usingLennard-Jones potential and/or Van der Waals atom radius of the atom inquestion). Also, the distance values can be normalized by the maximumscan radius, or by a maximum observed distance value of the furthestnearest atom within a subject amino acid category or a subject atomicelement category or a subject atom type category. In someimplementations, the distances between the voxels and the atoms arecalculated based on polar coordinates of the voxels and the atoms. Thepolar coordinates are parameterized by angles between the voxels and theatoms. In one implementation, this angel information is used to generatean angle channel for the voxels (i.e., independent of the distancechannels). In some implementations, angles between a nearest atom andneighboring atoms (e.g., backbone atoms) can be used as features thatare encoded with the voxels.

Reference Allele and Alternative Allele Channels

The voxels in the voxel grid can also have reference allele andalternative allele channels. At step 162, a one-hot encoder 164 of thesystem generates a reference one-hot encoding of a reference amino acidin the reference amino acid sequence and an alternative one-hot encodingof an alternative amino acid in an alternative amino acid sequence. Thereference amino acid experiences the target variant. The alternativeamino acid is the target variant. The reference amino acid and thealternative amino acid are located at a same position respectively inthe reference amino acid sequence and the alternative amino acidsequence. The reference amino acid sequence and the alternative aminoacid sequence have the same position-wise amino acid composition withone exception. The exception is the position that has the referenceamino acid in the reference amino acid sequence and the alternativeamino acid in the alternative amino acid sequence.

At step 172, a concatenator 174 of the system concatenates the aminoacid-wise distance channels and the reference and alternative one-hotencodings. In another implementation, the concatenator 174 concatenatesthe atomic element-wise distance channels and the reference andalternative one-hot encodings. In yet another implementation, theconcatenator 174 concatenates the atomic type-wise distance channels andthe reference and alternative one-hot encodings.

At step 182, runtime logic 184 of the system processes the concatenatedamino acid-wise/atomic element-wise/atomic type-wise distance channelsand the reference and alternative one-hot encodings through apathogenicity classifier (pathogenicity determination engine) todetermine a pathogenicity of the target variant, which is in turninferred as a pathogenicity determination of the underlying nucleotidevariant that creates the target variant at the amino acid level. Thepathogenicity classifier is trained using labelled datasets of benignand pathogenic variants, for example, using the backpropagationalgorithm. Additional details about the labelled datasets of benign andpathogenic variants and example architectures and training of thepathogenicity classifier can be found in commonly owned U.S. patentapplication Ser. Nos. 16/160,903; 16/160,986; 16/160,968; and16/407,149.

FIG. 2 schematically illustrates a reference amino acid sequence 202 ofa protein 200 and an alternative amino acid sequence 212 of the protein200. The protein 200 comprises N amino acids. Positions of the aminoacids in the protein 200 are labelled 1, 2, 3. . . N. In the illustratedexample, position 16 is the location that experiences an amino acidvariant 214 (mutation) caused by an underlying nucleotide variant. Forexample, for the reference amino acid sequence 202, position 1 hasreference amino acid Phenylalanine (F), position 16 has reference aminoacid Glycine (G) 204, and position N (e.g., the last amino acid of thesequence 202) has reference amino acid Leucine (L). Though notillustrated for clarity, remaining positions in the reference amino acidsequence 202 contain various amino acids in an order that is specific tothe protein 200. The alternative amino acid sequence 212 is the same asthe reference amino acid sequence 202 except for the variant 214 atposition 16, which contains the alternative amino acid Alanine (A) 214instead of the reference amino acid Glycine (G) 204.

FIG. 3 illustrates amino acid-wise classification of atoms of aminoacids in the reference amino acid sequence 202, also referred to hereinas “atom classification 300.” Specific types of amino acids, among thetwenty natural amino acids listed in column 302, may repeat in aprotein. That is, a particular type of amino acid may occur more thanonce in a protein. Proteins may also have some undetermined amino acidsthat are categorized by a twenty-first stop or gap amino acid category.The right column in FIG. 3 contains counts of alpha-carbon (C_(α)) atomsfrom different amino acids.

Specifically, FIG. 3 shows amino acid-wise classification ofalpha-carbon (C_(α)) atoms of the amino acids in the reference aminoacid sequence 202. Column 308 of FIG. 3 lists the total number ofalpha-carbon atoms observed for the reference amino acid sequence 202 ineach of the twenty-one amino acid categories. For example, column 308lists eleven alpha-carbon atoms observed for the Alanine (A) amino acidcategory. Since each amino acid has only one alpha-carbon atom, thismeans that Alanine occurs 11 times in the reference amino acid sequence202. In another example, Arginine (R) occurs thirty-five times in thereference amino acid sequence 202. The total number of alpha-carbonatoms across the twenty-one amino acid categories is eight hundred andtwenty-eight.

FIG. 4 illustrates amino acid-wise attribution of 3D atomic coordinatesof the alpha-carbon atoms of the reference amino acid sequence 202 basedon the atom classification 300 in FIG. 3 . This is referred to herein as“atomic coordinates bucketing 400.” In FIG. 4 , lists 404-440 tabulatethe 3D atomic coordinates of the alpha-carbon atoms bucketed to each ofthe twenty-one amino acid categories.

In the illustrated implementation, the bucketing 400 in FIG. 4 followsthe classification 300 of FIG. 3 . For example, in FIG. 3 , the Alanineamino acid category has eleven alpha-carbon atoms, and therefore, inFIG. 4 , the Alanine amino acid category has eleven 3D atomiccoordinates of the corresponding eleven alpha-carbon atoms from FIG. 3 .This classification-to-bucketing logic flows from FIG. 3 to FIG. 4 forother amino acid categories too. However, thisclassification-to-bucketing logic is only for representational purposes,and, in other implementations, the technology disclosed need not performthe classification 300 and the bucketing 400 to locate the voxel-wisenearest atoms, and may perform fewer, additional, or different steps.For example, in some implementations, the technology disclosed canlocate the voxel-wise nearest atoms by using a sort and search algorithmthat returns the voxel-wise nearest atoms from one or more databases inresponse to a search query configured to accept query parameters likesort criteria (e.g., amino acid-wise, atomic element-wise, atomtype-wise), the predefined maximum scan radius, and the type ofdistances (e.g., Euclidean, Mahalanobis, normalized, unnormalized). Invarious implementations of the technology disclosed, a plurality of sortand search algorithms from the current or future technical field can beanalogous used by a person skilled in the art to locate the voxel-wisenearest atoms.

In FIG. 4 , the 3D atomic coordinates are represented by cartesiancoordinates x, y, z, but any type of coordinate system may be used, suchas spherical or cylindrical coordinates, and claimed subject matter isnot limited in this respect. In some implementations, one or moredatabases may include information regarding the 3D atomic coordinates ofthe alpha-carbon atoms and other atoms of amino acids in proteins. Suchdatabases may be searchable by specific proteins.

As discussed above, the voxels and the voxel grid are 3D entities.However, for clarity's sake, the drawings depict, and the descriptiondiscusses the voxels and the voxel grid in a two-dimensional (2D)format. For example, a 3×3×3 voxel grid of twenty-seven voxels isdepicted and described herein as a 3×3 2D pixel grid with nine 2Dpixels. A person skilled in the art will appreciate that the 2D formatis used only for representational purposes and is intended to cover the3D counterparts (i.e., 2D pixels represent 3D voxels and 2D pixel gridrepresents 3D voxel grid). Also, the drawings are also not scale. Forexample, voxels of size two angstrom (Å) are depicted using a singlepixel.

Voxel-Wise Distance Calculation

FIG. 5 schematically illustrates a process of determining voxel-wisedistance values, also referred to herein as “voxel-wise distancecalculation 500.” In the illustrated example, the voxel-wise distancevalues are calculated only for the Alanine (A) distance channel.However, the same distance calculation logic is executed for each of thetwenty-one amino acid categories to generate twenty-one amino acid-wisedistance channels and can be further expanded to other atom types likebeta-carbon atoms and other atomic elements like oxygen, nitrogen, andhydrogen, as discussed above with respect to FIG. 1 . In someimplementations, the atoms are randomly rotated prior to the distancecalculation to make the training of the pathogenicity classifierinvariant to atom orientation.

In FIG. 5 , a voxel grid 522 has nine voxels 514 identified with indices(1, 1), (1, 2), (1, 3), (2, 1), (2, 2), (2, 3), (3, 1), (3, 2), and (3,3). The voxel grid 522 is centered, for example, at the 3D atomiccoordinate 532 of the alpha-carbon atom of the Glycine (G) amino acid atposition 16 in the reference amino acid sequence 202 because, in thealternative amino acid sequence 212, the position 16 experiences thevariant that mutates the Glycine (G) amino acid to the Alanine (A) aminoacid, as discussed above with respect to FIG. 2 . Also, the center ofthe voxel grid 522 coincides with the center of voxel (2, 2).

The centered voxel grid 522 is used for the voxel-wise distancecalculation for each of the twenty-one amino acid-wise distancechannels. Starting, for example, with the Alanine (A) distance channel,distances between the 3D coordinates of respective centers of the ninevoxels 514 and the 3D atomic coordinates 402 of the eleven Alaninealpha-carbon atoms are measured to locate a nearest Alanine alpha-carbonatom for each of the nine voxels 514. Then, nine distance values fornine distances between the nine voxels 514 and the respective nearestAlanine alpha-carbon atoms are used to construct the Alanine distancechannel. The resulting Alanine distance channel arranges the nineAlanine distance values in the same order as the nine voxels 514 in thevoxel grid 522.

The above process is executed for each of the twenty-one amino acidcategories. For example, the centered voxel grid 522 is similarly usedto calculate the Arginine (R) distance channel, such that distancesbetween the 3D coordinates of respective centers of the nine voxels 514and the 3D atomic coordinates 404 of the thirty-five Argininealpha-carbon atoms are measured to locate a nearest Argininealpha-carbon atom for each of the nine voxels 514. Then, nine distancevalues for nine distances between the nine voxels 514 and the respectivenearest Arginine alpha-carbon atoms are used to construct the Argininedistance channel. The resulting Arginine distance channel arranges thenine Arginine distance values in the same order as the nine voxels 514in the voxel grid 522. The twenty-one amino acid-wise distance channelsare voxel-wise encoded to form a distance channel tensor.

Specifically, in the illustrated example, a distance 512 is between thecenter of voxel (1, 1) of voxel grid 522 and the nearest alpha-carbon(C_(α)) atom, which is the Cα^(A5) atom in list 402. Accordingly, thevalue assigned to voxel (1, 1) is the distance 512. In another example,the Cα^(A4) atom is the nearest C_(α) atom to the center of voxel (1,2). Accordingly, the value assigned to voxel (1, 2) is the distancebetween the center of voxel (1, 2) and the Cα^(A4) atom. In stillanother example, the Cα^(A6) atom is the nearest C_(α) atom to thecenter of voxel (2, 1). Accordingly, the value assigned to voxel (2, 1)is the distance between the center of voxel (2, 1) and the Cα^(A6) atom.In still another example, the Cα^(A6) atom is also the nearest C_(α)atom to the center of voxels (3, 2) and (3, 3). Accordingly, the valueassigned to voxel (3, 2) is the distance between the center of voxel (3,2) and the Cα^(A6) atom and the value assigned to voxel (3, 3) is thedistance between the center of voxel (3, 3) and the Cα^(A6) atom. Insome implementations, the distance values assigned to the voxels 514 maybe normalized distances. For example, the distance value assigned tovoxel (1, 1) may be the distance 512 divided by a maximum distance 502(predefined maximum scan radius). In some implementations, thenearest-atom distances may be Euclidean distances and the nearest-atomdistances may be normalized by dividing the Euclidean distances with amaximum nearest-atom distance (e.g., such as the maximum distance 502).

As described above, for amino acids having alpha-carbon atoms, thedistances may be nearest-alpha-carbon atom distances from correspondingvoxel centers to nearest alpha-carbon atoms of the corresponding aminoacids. Additionally, for amino acids having beta-carbon atoms, thedistances may be nearest-beta-carbon atom distances from correspondingvoxel centers to nearest beta-carbon atoms of the corresponding aminoacids. Similarly, for amino acids having backbone atoms, the distancesmay be nearest-backbone atom distances from corresponding voxel centersto nearest backbone atoms of the corresponding amino acids. Similarly,for amino acids having sidechain atoms, the distances may benearest-sidechain atom distances from corresponding voxel centers tonearest sidechain atoms of the corresponding amino acids. In someimplementations, the distances additionally/alternatively can includedistances to second, third, fourth nearest atoms, and so on.

Amino Acid-Wise Distance Channels

FIG. 6 shows an example of twenty-one amino acid-wise distance channels600. Each column in FIG. 6 corresponds to a respective one of thetwenty-one amino acid-wise distance channels 602-642. Each aminoacid-wise distance channel comprises a distance value for each of thevoxels 514 of the voxel grid 522. For example, the amino acid-wisedistance channel 602 for Alanine (A) comprises distance values forrespective ones of the voxels 514 of the voxel grid 522. As mentionedabove, the voxel grid 522 is 3D grid of volume 3×3×3 and comprisestwenty-seven voxels. Likewise, though FIG. 6 illustrates the voxels 514in two dimensions (e.g., nine voxels of a 3×3 grid), each aminoacid-wise distance channel may comprise twenty-seven voxel-wise distancevalues for the 3×3×3 voxel grid.

Directionality Encoding

In some implementations, the technology disclosed uses a directionalityparameter to specify the directionality of the reference amino acids inthe reference amino acid sequence 202. In some implementations, thetechnology disclosed uses the directionality parameter to specify thedirectionality of the alternative amino acids in the alternative aminoacid sequence 212. In some implementations, the technology discloseduses the directionality parameter to specify the position in the protein200 that experiences the target variant at the amino acid level.

As discussed above, all the distance values in the twenty-one aminoacid-wise distance channels 602-642 are measured from respective nearestatoms to the voxels 514 in the voxel grid 512. These nearest atomsoriginate from one of the reference amino acids in the reference aminoacid sequence 202. These originating reference amino acids, whichcontain the nearest atoms, can be classified into two categories: (1)those originating reference amino acids that precede thevariant-experiencing reference amino acid 204 in the reference aminoacid sequence 202 and (2) those originating reference amino acids thatsucceed the variant-experiencing reference amino acid 204 in thereference amino acid sequence 202. The originating reference amino acidsin the first category can be called preceding reference amino acids. Theoriginating reference amino acids in the second category can be calledsucceeding reference amino acids.

The directionality parameter is applied to those distance values in thetwenty-one amino acid-wise distance channels 602-642 that are measuredfrom those nearest atoms that originate from the preceding referenceamino acids. In one implementation, the directionality parameter ismultiplied with such distance values. The directionality parameter canbe any number, such as −1.

As a result of the application of the directionality parameter, thetwenty-one amino acid-wise distance channels 600 include some distancevalues that indicate to the pathogenicity classifier which end of theprotein 200 is the start terminal and which end is the end terminal.This also allows the pathogenicity classifier to reconstruct a proteinsequence from the 3D protein structure information supplied by thedistance channels and the reference and allele channels.

Distance Channel Tensor

FIG. 7 is a schematic diagram of a distance channel tensor 700. Distancechannel tensor 700 is a voxelized representation of the amino acid-wisedistance channels 600 from FIG. 6 . In the distance channel tensor 700,the twenty-one amino acid-wise distance channels 602-642 areconcatenated voxel-wise, like RGB channels of a color image. Thevoxelized dimensionality of the distance channel tensor 700 is 21×3×3×3(where 21 denotes the twenty-one amino acid categories and 3×3×3 denotesthe 3D voxel grid with twenty-seven voxels); although FIG. 7 is a 2Ddepiction of dimensionality 21×3×3.

One-Hot Encodings

FIG. 8 shows one-hot encodings 800 of the reference amino acid 204 andthe alternative amino acid 214. In FIG. 8 , left column is a one-hotencoding 802 of the reference amino acid Glycine (G) 204, with one forthe Glycine amino acid category and zeros for all other amino acidcategories. In FIG. 8 , right column is a one-hot encoding 804 of thevariant/alternative amino acid Alanine (A) 214, with one for the Alanineamino acid category and zeros for all other amino acid categories.

FIG. 9 is a schematic diagram of a voxelized one-hot encoded referenceamino acid 902 and a voxelized one-hot encoded variant/alternative aminoacid 912. The voxelized one-hot encoded reference amino acid 902 is avoxelized representation of the one-hot encoding 802 of the referenceamino acid Glycine (G) 204 from FIG. 8 . The voxelized one-hot encodedalternative amino acid 912 is a voxelized representation of the one-hotencoding 804 of the variant/alternative amino acid Alanine (A) 214 fromFIG. 8 . The voxelized dimensionality of the voxelized one-hot encodedreference amino acid 902 is 21×1×1×1 (where <denotes the twenty-oneamino acid categories); although FIG. 9 is a 2D depiction ofdimensionality 21×1×1. Similarly, the voxelized dimensionality of thevoxelized one-hot encoded alternative amino acid 912 is 21×1×1×1 (where<denotes the twenty-one amino acid categories); although FIG. 9 is a 2Ddepiction of dimensionality 21×1×1.

Reference Allele Tensor

FIG. 10 schematically illustrates a concatenation process 1000 thatvoxel-wise concatenates the distance channel tensor 700 of FIG. 7 and areference allele tensor 1004. The reference allele tensor 1004 is avoxel-wise aggregation (repetition/cloning/replication) of the voxelizedone-hot encoded reference amino acid 902 from FIG. 9 . That is, multiplecopies of the voxelized one-hot encoded reference amino acid 902 arevoxel-wise concatenated according with each other to the spatialarrangement of the voxels 514 in the voxel grid 512, such that thereference allele tensor 1004 has a corresponding copy of the voxelizedone-hot encoded reference amino acid 910 for each of the voxels 514 inthe voxel grid 512.

The concatenation process 1000 produces a concatenated tensor 1010. Thevoxelized dimensionality of the reference allele tensor 1004 is 21×3×3×3(where <denotes the twenty-one amino acid categories and 3×3×3 denotesthe 3D voxel grid with twenty-seven voxels); although FIG. 10 is a 2Ddepiction of the reference allele tensor 1004 having dimensionality21×3×3. The voxelized dimensionality of the concatenated tensor 1010 is42×3×3×3; although FIG. 10 is a 2D depiction of the concatenated tensor1010 having dimensionality 42×3×3.

Alternative Allele Tensor

FIG. 11 schematically illustrates a concatenation process 1100 thatvoxel-wise concatenates the distance channel tensor 700 of FIG. 7 , thereference allele tensor 1004 of FIG. 10 , and an alternative alleletensor 1104. The alternative allele tensor 1104 is a voxel-wiseaggregation (repetition/cloning/replication) of the voxelized one-hotencoded alternative amino acid 912 from FIG. 9 . That is, multiplecopies of the voxelized one-hot encoded alternative amino acid 912 arevoxel-wise concatenated with each other according to the spatialarrangement of the voxels 514 in the voxel grid 512, such that thealternative allele tensor 1104 has a corresponding copy of the voxelizedone-hot encoded alternative amino acid 910 for each of the voxels 514 inthe voxel grid 512.

The concatenation process 1100 produces a concatenated tensor 1110. Thevoxelized dimensionality of the alternative allele tensor 1104 is21×3×3×3 (where <denotes the twenty-one amino acid categories and 3×3×3denotes the 3D voxel grid with twenty-seven voxels); although FIG. 11 isa 2D depiction of the alternative allele tensor 1104 havingdimensionality 21×3×3. The voxelized dimensionality of the concatenatedtensor 1110 is 63×3×3×3; although FIG. 11 is a 2D depiction of theconcatenated tensor 1110 having dimensionality 63×3×3.

In some implementations, the runtime logic 184 processes theconcatenated tensor 1110 through the pathogenicity classifier todetermine a pathogenicity of the variant/alternative amino acid Alanine(A) 214, which is in turn inferred as a pathogenicity determination ofthe underlying nucleotide variant that creates the variant/alternativeamino acid Alanine (A) 214.

Evolutionary Conservation Channels

Predicting the functional consequences of variants relies at least inpart on the assumption that crucial amino acids for protein families areconserved through evolution due to negative selection (i.e., amino acidchanges at these sites were deleterious in the past), and that mutationsat these sites have an increased likelihood of being pathogenic (causingdisease) in humans In general, homologous sequences of a target proteinare collected and aligned, and a metric of conservation is computedbased on the weighted frequencies of different amino acids observed inthe target position in the alignment.

Accordingly, the technology disclosed concatenates the distance channeltensor 700, the reference allele tensor 1004, and the alternative alleletensor 1004 with evolutionary channels. One example of the evolutionarychannels is pan-amino acid conservation frequencies. Another example ofthe evolutionary channels is per-amino acid conservation frequencies.

In some implementations, the evolutionary channels are constructed usingposition weight matrices (PWMs). In other implementations, theevolutionary channels are constructed using position specific frequencymatrices (PSFMs). In yet other implementations, the evolutionarychannels are constructed using computational tools like SIFT, PolyPhen,and PANTHER-PSEC. In yet other implementations, the evolutionarychannels are preservation channels based on evolutionary preservation.Preservation is related to conservation, as it also reflects the effectof negative selection that has acted to prevent evolutionary change at agiven site in a protein.

Pan-Amino Acid Evolutionary Profiles

FIG. 12 is a flow diagram that illustrates a process 1200 of a systemfor determining and assigning pan-amino acid conservation frequencies ofnearest atoms to voxels (voxelizing), in accordance with oneimplementation of the technology disclosed. FIGS. 12, 13, 14, 15, 16,17, and 18 are discussed in tandem.

At step 1202, a similar sequence finder 1204 of the system retrievesamino acid sequences that are similar (homologous) to the referenceamino acid sequence 202. The similar amino acid sequences can beselected from multiple species like primates, mammals, and vertebrates.

At step 1212, an aligner 1214 of the system position-wise aligns thereference amino acid sequence 202 with the similar amino acid sequences,i.e., the aligner 1214 performs a multi-sequence alignment. FIG. 14shows an example multi-sequence alignment 1400 of the reference aminoacid sequence 202 across a ninety-nine species. In some implementations,the multi-sequence alignment 1400 can be partitioned, for example, togenerate a first position frequency matrix 1402 for primates, a secondposition frequency matrix 1412 for mammals, and a third positionfrequency matrix 1422 for primates. In other implementations, a singleposition frequency matrix is generated across the ninety-nine species.

At step 1222, a pan-amino acid conservation frequency calculator 1224 ofthe system uses the multi-sequence alignment to determine pan-amino acidconservation frequencies of the reference amino acids in the referenceamino acid sequence 202.

At step 1232, a nearest atom finder 1234 of the system finds nearestatoms to the voxels 514 in the voxel grid 512. In some implementations,the search for the voxel-wise nearest atoms may not be confined to anyparticular amino acid category or atom type. That is, the voxel-wisenearest atoms can be selected across the amino acid categories and theamino acid types, as long as they are the most proximate atoms to therespective voxel centers. In other implementations, the search for thevoxel-wise nearest atoms may be confined to only a particular atomcategory, such as only to a particular atomic element like oxygen,nitrogen, and hydrogen, or only to alpha-carbon atoms, or only tobeta-carbon atoms, or only to sidechain atoms, or only to backboneatoms.

At step 1242, an amino acid selector 1244 of the system selects thosereference amino acids in the reference amino acid sequence 202 thatcontain the nearest atoms identified at the step 1232. Such referenceamino acids can be called nearest reference amino acids. FIG. 13 showsan example of locating nearest atoms 1302 to the voxels 514 in the voxelgrid 512 and respectively mapping nearest reference amino acids 1312that contain the nearest atoms 1302 to the voxels 514 in the voxel grid512. This is identified in FIG. 13 as “voxels-to-nearest amino acidsmapping 1300.”

At step 1252, a voxelizer 1254 of the system voxelizes pan-amino acidconservation frequencies of the nearest reference amino acids. FIG. 15shows an example of determining a pan-amino acid conservationfrequencies sequence for the first voxel (1, 1) in the voxel grid 512,also referred to herein as “per-voxel evolutionary profile determination1500.”

Turning to FIG. 13 , the nearest reference amino acid that was mapped tothe first voxel (1, 1) is Aspartic acid (D) amino acid at position 15 inthe reference amino acid sequence 202. Then, the multi-sequencealignment of the reference amino acid sequence 202 with, for example,ninety-nine homologous amino acid sequences of the ninety-nine speciesis analyzed at position 15. Such a position-specific and cross-speciesanalysis reveals how many instances of amino acids from each of thetwenty-one amino acid categories are found at position 15 across thehundred aligned amino acid sequences (i.e., the reference amino acidsequence 202 plus the ninety-nine homologous amino acid sequences).

In the example illustrated in FIG. 15 , the Aspartic acid (D) amino acidis found at position 15 in ninety-six out of the hundred aligned aminoacid sequences. So, the Aspartic acid amino acid category 1504 isassigned a pan-amino acid conservation frequency of 0.96. Similarly, inthe illustrated example, the Valine (V) acid amino acid is found atposition 15 in four out of the hundred aligned amino acid sequences. So,the Valine acid amino acid category 1514 is assigned a pan-amino acidconservation frequency of 0.04. Since no instances of amino acids fromother amino acid categories are detected at position 15, the remainingamino acid categories are assigned a pan-amino acid conservationfrequency of zero. This way, each of the twenty-one amino acidcategories is assigned a respective pan-amino acid conservationfrequency, which can be encoded in the pan-amino acid conservationfrequencies sequence 1502 for the first voxel (1, 1).

FIG. 16 shows respective pan-amino acid conservation frequencies1612-1692 determined for respective ones of the voxels 514 in the voxelgrid 512 using the position frequency logic described in FIG. 15 , alsoreferred to herein as “voxels-to-evolutionary profiles mapping 1600.”

Per-voxel evolutionary profiles 1602 are then used by the voxelizer 1254to generate voxelized per-voxel evolutionary profiles 1700, illustratedin FIG. 17 . Often, each of the voxels 514 in the voxel grid 512 has adifferent pan-amino acid conservation frequencies sequence and thereforea different voxelized per-voxel evolutionary profile because the voxelsare regularly mapped to different nearest atoms and therefore todifferent nearest reference amino acids. Of course, when two or morevoxels have a same nearest atom and thereby a same nearest referenceamino acid, a same pan-amino acid conservation frequencies sequence anda same voxelized per-voxel evolutionary profile is assigned to each ofthe two or more voxels.

FIG. 18 depicts an example of an evolutionary profiles tensor 1800 inwhich the voxelized per-voxel evolutionary profiles 1700 are voxel-wiseconcatenated with each other according to the spatial arrangement of thevoxels 514 in the voxel grid 512. The voxelized dimensionality of theevolutionary profiles tensor 1800 is 21×3×3×3 (where <denotes thetwenty-one amino acid categories and 3×3×3 denotes the 3D voxel gridwith twenty-seven voxels); although FIG. 18 is a 2D depiction of theevolutionary profiles tensor 1800 having dimensionality 21×3×3.

At step 1262, the concatenator 174 voxel-wise concatenates theevolutionary profiles tensor 1800 with the distance channel tensor 700.In some implementations, the evolutionary profiles tensor 1800 isvoxel-wise concatenated with the concatenator tensor 1110 to generate afurther concatenated tensor of dimensionality 84×3×3×3 (not shown).

At step 1272, the runtime logic 184 processes the further concatenatedtensor of dimensionality 84×3×3×3 through the pathogenicity classifierto determine the pathogenicity of the target variant, which is in turninferred as a pathogenicity determination of the underlying nucleotidevariant that creates the target variant at the amino acid level.

Per-Amino Acid Evolutionary Profiles

FIG. 19 is a flow diagram that illustrates a process 1900 of a systemfor determining and assigning per-amino acid conservation frequencies ofnearest atoms to voxels (voxelizing). In FIG. 19 , the steps 1202 and1212 are the same as FIG. 12 .

At step 1922, a per-amino acid conservation frequency calculator 1924 ofthe system uses the multi-sequence alignment to determine per-amino acidconservation frequencies of the reference amino acids in the referenceamino acid sequence 202.

At step 1932, a nearest atom finder 1934 of the system finds, for eachof the voxels 514 in the voxel grid 512, twenty-one nearest atoms acrosseach of the twenty-one amino acid categories. Each of the twenty-onenearest atoms is different from each other because they are selectedfrom different amino acid categories. This leads to the selection oftwenty-one unique nearest reference amino acids for a particular voxel,which in turn leads to generation of twenty-one unique positionfrequency matrices for the particular voxel, and which in turn leads todetermination of twenty-one unique per-amino acid conservationfrequencies for the particular voxel.

At step 1942, an amino acid selector 1944 of the system selects, foreach of the voxels 514 in the voxel grid 512, twenty-one reference aminoacids in the reference amino acid sequence 202 that contain thetwenty-one nearest atoms identified at the step 1932. Such referenceamino acids can be called nearest reference amino acids.

At step 1952, a voxelizer 1954 of the system voxelizes pen-amino acidconservation frequencies of the twenty-one nearest reference amino acidsidentified for the particular voxel at the step 1942. The twenty-onenearest reference amino acids are necessarily located at twenty-onedifferent positions in the reference amino acid sequence 202 becausethey correspond to different underlying nearest atoms. Accordingly, forthe particular voxel, twenty-one position frequency matrices can begenerated for the twenty-one nearest reference amino acids. Thetwenty-one position frequency matrices can be generated across multiplespecies whose homologous amino acid sequences are position-wise alignedwith the reference amino acid sequence 202, as discussed above withrespect to FIGS. 12 to 15 .

Then, using the twenty-one position frequency matrices, twenty-oneposition-specific conservation scores can be calculated for thetwenty-one nearest reference amino acids identified for the particularvoxel. These twenty-one position-specific conservation scores form thepen-amino acid conservation frequencies for the particular voxel,similar to the pan-amino acid conservation frequencies sequence 1502 inFIG. 12 ; except the sequence 1502 has many zero entries, whereas eachelement (feature) in a per-amino acid conservation frequencies sequencehas a value (e.g., a floating point number) because the twenty-onenearest reference amino acids across the twenty-one amino acidcategories necessarily have different positions that yield differentposition frequency matrices and thereby different per-amino acidconservation frequencies.

The above process is executed for each of the voxels 514 in the voxelgrid 512, and the resulting voxel-wise per-amino acid conservationfrequencies voxelized, tensorized, concatenated, and processed forpathogenicity determination similar to the pan-amino acid conservationfrequencies discussed with respect to FIGS. 12 to 18 .

Annotation Channels

FIG. 20 shows various examples of voxelized annotation channels 2000that are concatenated with the distance channel tensor 700. In someimplementations, the voxelized annotation channels are one-hotindicators for different protein annotations, for example whether anamino acid (residue) is part of a transmembrane region, a signalpeptide, an active site, or any other binding site, or whether theresidue is subject to posttranslational modifications, PathRatio (SeePei P, Zhang A: A Topological Measurement for Weighted ProteinInteraction Network. CSB 2005, 268-278.), etc. Additional examples ofthe annotation channels can be found below in the ParticularImplementations section and in the Claims

The voxelized annotation channels are arranged voxel-wise such that thevoxels can have a same annotation sequence like the voxelized referenceallele and alternative allele sequences (e.g., annotation channels 2002,2004, 2006), or the voxels can have respective annotation sequences likethe voxelized per-voxel evolutionary profiles 1700 (e.g., annotationchannels 2012, 2014, 2016 (as indicated by different colors)).

The annotation channels are voxelized, tensorized, concatenated, andprocessed for pathogenicity determination similar to the pan-amino acidconservation frequencies discussed with respect to FIGS. 12 to 18 .

Structural Confidence Channels

The technology disclosed can also concatenate various voxelizedstructural confidence channels with the distance channel tensor 700.Some examples of the structure confidence channels include GMQE score(provided by SwissModel); B-factor; temperature factor column ofhomology models (indicates how well a residue satisfies (physical)constraints in the protein structure); normalized number of aligningtemplate proteins for the residue nearest to the center of a voxel(alignments provided by HHpred, e.g., voxel is nearest to a residue atwhich 3 of 6 template structures align, signifying that the feature hasvalue 3/6=0.5; minimum, maximum, and mean TM-scores; and predictedTM-scores of the template protein structures that align to the residuethat is nearest to a voxel (continuing the example above, assume the 3template structure has TM-scores 0.5, 0.5 and 1.5, then the minimum is0.5, the mean is ⅔, and the maximum is 1.5). The TM-scores can beprovided per protein template by HHpred. Additional examples of thestructural confidence channels can be found below in the ParticularImplementations section and in the Claims

The voxelized structural confidence channels are arranged voxel-wisesuch that the voxels can have a same structural confidence sequence likethe voxelized reference allele and alternative allele sequences, or thevoxels can have respective structural confidence sequences like thevoxelized per-voxel evolutionary profiles 1700.

The structural confidence channels are voxelized, tensorized,concatenated, and processed for pathogenicity determination similar tothe pan-amino acid conservation frequencies discussed with respect toFIGS. 12 to 18 .

Pathogenicity Classifier

FIG. 21 illustrates different combinations and permutations of inputchannels that can be provided as inputs 2102 to a pathogenicityclassifier 2108 for a pathogenicity determination 2106 of a targetvariant. One of the inputs 2102 can be distance channels 2104 generatedby a distance channels generator 2272. FIG. 22 shows different methodsof calculating the distance channels 2104. In one implementation, thedistance channels 2104 are generated based on distances 2202 betweenvoxel centers and atoms across a plurality of atomic elementsirrespective of amino acids. In some implementations, the distances 2202are normalized by a maximum scan radius to generate normalized distances2202 a. In another implementation, the distance channels 2104 aregenerated based on distances 2212 between voxel centers and alpha-carbonatoms on an amino acid-basis. In some implementations, the distances2212 are normalized by the maximum scan radius to generate normalizeddistances 2212 a. In yet another implementation, the distance channels2104 are generated based on distances 2222 between voxel centers andbeta-carbon atoms on an amino acid-basis. In some implementations, thedistances 2222 are normalized by the maximum scan radius to generatenormalized distances 2222 a. In yet another implementation, the distancechannels 2104 are generated based on distances 2232 between voxelcenters and side chain atoms on an amino acid-basis. In someimplementations, the distances 2232 are normalized by the maximum scanradius to generate normalized distances 2232 a. In yet anotherimplementation, the distance channels 2104 are generated based ondistances 2242 between voxel centers and backbone atoms on an aminoacid-basis. In some implementations, the distances 2242 are normalizedby the maximum scan radius to generate normalized distances 2242 a. Inyet another implementation, the distance channels 2104 are generatedbased on distances 2252 (one feature) between voxel centers and therespective nearest atoms irrespective of atom type and amino acid type.In yet another implementation, the distance channels 2104 are generatedbased on distances 2262 (one feature) between voxel centers and atomsfrom non-standard amino acids. In some implementations, the distancesbetween the voxels and the atoms are calculated based on polarcoordinates of the voxels and the atoms. The polar coordinates areparameterized by angles between the voxels and the atoms. In oneimplementation, this angel information is used to generate an anglechannel for the voxels (i.e., independent of the distance channels). Insome implementations, angles between a nearest atom and neighboringatoms (e.g., backbone atoms) can be used as features that are encodedwith the voxels.

Another one of the inputs 2102 can be a feature 2114 indicating missingatoms within a specified radius.

Another one of the inputs 2102 can be one-hot encoding 2124 of thereference amino acid. Another one of the inputs 2102 can be one-hotencoding 2134 of the variant/alternative amino acid.

Another one of the inputs 2102 can be evolutionary channels 2144generated by an evolutionary profiles generator 2372, shown in FIG. 23 .In one implementation, the evolutionary channels 2144 can be generatedbased on pan-amino acid conservation frequencies 2302. In anotherimplementation, the evolutionary channels 2144 can be generated based onpan-amino acid conservation frequencies 2312.

Another one of the inputs 2102 can be a feature 2154 indicating missingresidue or missing evolutionary profile.

Another one of the inputs 2102 can be annotations channels 2164generated by an annotations generator 2472, shown in FIG. 24 . In oneimplementation, the annotations channels 2154 can be generated based onmolecular processing annotations 2402. In another implementation, theannotations channels 2154 can be generated based on regions annotations2412. In yet another implementation, the annotations channels 2154 canbe generated based on sites annotations 2422. In yet anotherimplementation, the annotations channels 2154 can be generated based onAmino acid modifications annotations 2432. In yet anotherimplementation, the annotations channels 2154 can be generated based onsecondary structure annotations 2442. In yet another implementation, theannotations channels 2154 can be generated based on experimentalinformation annotations 2452.

Another one of the inputs 2102 can be structure confidence channels 2174generated by a structure confidence generator 2572, shown in FIG. 25 .In one implementation, the structure confidence 2174 can be generatedbased on global model quality estimations (GMQEs) 2502. In anotherimplementation, the structure confidence 2174 can be generated based onqualitative model energy analysis (QMEAN) scores 2512. In yet anotherimplementation, the structure confidence 2174 can be generated based ontemperature factors 2522. In yet another implementation, the structureconfidence 2174 can be generated based on template modeling scores 2542.Examples of the template modeling scores 2542 include minimum templatemodeling scores 2542 a, mean template modeling scores 2542 b, andmaximum template modeling scores 2542 c.

A person skilled in the art will appreciate that any permutation andcombination of the input channels can be concatenated into an input forprocessing through the pathogenicity classifier 2108 for thepathogenicity determination 2106 of the target variant. In someimplementations, only a subset of the input channels may beconcatenated. The input channels can be concatenated in any order. Inone implementation, the input channels can be concatenated into a singletensor by a tensor generator (input encoder) 2110. This single tensorcan then be provided as input to the pathogenicity classifier 2108 forthe pathogenicity determination 2106 of the target variant.

In one implementation, the pathogenicity classifier 2108 usesconvolutional neural networks (CNNs) with a plurality of convolutionlayers. In another implementation, the pathogenicity classifier 2108uses recurrent neural networks (RNNs) such as a long short-term memorynetworks (LSTMs), bi-directional LSTMs (Bi-LSTMs), and gated recurrentunits (GRU)s. In yet another implementation, the pathogenicityclassifier 2108 uses both the CNNs and the RNNs. In yet anotherimplementation, the pathogenicity classifier 2108 usesgraph-convolutional neural networks that model dependencies ingraph-structured data. In yet another implementation, the pathogenicityclassifier 2108 uses variational autoencoders (VAEs). In yet anotherimplementation, the pathogenicity classifier 2108 uses generativeadversarial networks (GANs). In yet another implementation, thepathogenicity classifier 2108 can also be a language model based, forexample, on self-attention such as the one implemented by Transformersand BERTs.

In yet other implementations, the pathogenicity classifier 2108 can use1D convolutions, 2D convolutions, 3D convolutions, 4D convolutions, 5Dconvolutions, dilated or atrous convolutions, transpose convolutions,depthwise separable convolutions, pointwise convolutions, 1×1convolutions, group convolutions, flattened convolutions, spatial andcross-channel convolutions, shuffled grouped convolutions, spatialseparable convolutions, and deconvolutions. It can use one or more lossfunctions such as logistic regression/log loss, multi-classcross-entropy/softmax loss, binary cross-entropy loss, mean-squarederror loss, L1 loss, L2 loss, smooth L1 loss, and Huber loss. It can useany parallelism, efficiency, and compression schemes such TFRecords,compressed encoding (e.g., PNG), sharding, parallel calls for maptransformation, batching, prefetching, model parallelism, dataparallelism, and synchronous/asynchronous stochastic gradient descent(SGD). It can include upsampling layers, downsampling layers, recurrentconnections, gates and gated memory units (like an LSTM or GRU),residual blocks, residual connections, highway connections, skipconnections, peephole connections, activation functions (e.g.,non-linear transformation functions like rectifying linear unit (ReLU),leaky ReLU, exponential liner unit (ELU), sigmoid and hyperbolic tangent(tanh)), batch normalization layers, regularization layers, dropout,pooling layers (e.g., max or average pooling), global average poolinglayers, attention mechanisms, and gaussian error linear unit.

The pathogenicity classifier 2108 is trained using backpropagation-basedgradient update techniques. Example gradient descent techniques that canbe used for training the pathogenicity classifier 2108 includestochastic gradient descent, batch gradient descent, and mini-batchgradient descent. Some examples of gradient descent optimizationalgorithms that can be used to train the pathogenicity classifier 2108are Momentum, Nesterov accelerated gradient, Adagrad, Adadelta, RMSprop,Adam, AdaMax, Nadam, and AMSGrad. In other implementations, thepathogenicity classifier 2108 can be trained by unsupervised learning,semi-supervised learning, self-learning, reinforcement learning,multitask learning, multimodal learning, transfer learning, knowledgedistillation, and so on.

FIG. 26 shows an example processing architecture 2600 of thepathogenicity classifier 2108, in accordance with one implementation ofthe technology disclosed. The processing architecture 2600 includes acascade of processing modules 2606, 2610, 2614, 2618, 2622, 2626, 2630,2634, 2638, and 2642 each of which can include 1D convolutions (1×1×1CONV), 3D convolutions (3×3×3 CONV), ReLU non-linearity, and batchnormalization (BN). Other examples of the processing modules includefully-connected (FC) layers, a dropout layer, a flattening layer, and afinal softmax layer that produces exponentially normalized scores forthe target variant belonging to a benign class and a pathogenic class.In FIG. 26 , “64” denotes a number of convolution filters applied by aparticular processing module. In FIG. 26 , the size of an input voxel2602 is 15×15×15×8. FIG. 26 also shows respective volumetricdimensionalities of the intermediate inputs 2604, 2608, 2612, 2616,2620, 2624, 2628, 2632, 2636, and 2640 generated by the processingarchitecture 2600.

FIG. 27 shows an example processing architecture 2700 of thepathogenicity classifier 2108, in accordance with one implementation ofthe technology disclosed. The processing architecture 2700 includes acascade of processing modules 2708, 2714, 2720, 2726, 2732, 2738, 2744,2750, 2756, 2762, 2768, 2774, and 2780 such as 1D convolutions (CONV1D), 3D convolutions (CONV 3D), ReLU non-linearity, and batchnormalization (BN). Other examples of the processing modules includefully-connected (dense) layers, a dropout layer, a flattening layer, anda final softmax layer that produces exponentially normalized scores forthe target variant belonging to a benign class and a pathogenic class.In FIG. 27 , “64” and “32” denote a number of convolution filtersapplied by a particular processing module. In FIG. 27 , the size of aninput voxel 2704 supplied by an input layer 2702 is 7×7×7×108. FIG. 27also shows respective volumetric dimensionalities of the intermediateinputs 2710, 2716, 2722, 2728, 2734, 2740, 2746, 2752, 2758, 2764, 2770,2776, and 2782 and the resulting intermediate outputs 2706, 2712, 2718,2724, 2730, 2736, 2742, 2748, 2754, 2760, 2766, 2772, 2778, and 2784generated by the processing architecture 2700.

A person skilled in the art will appreciate that other current andfuture artificial intelligence, machine learning, and deep learningmodels, datasets, and training techniques can be incorporated in thedisclosed variant pathogenicity classifier without deviating from thespirit of the technology disclosed.

Performance Results as Objective Indicia of Inventiveness andNon-Obviousness

The variant pathogenicity classifier disclosed herein makespathogenicity predictions based on 3D protein structures and is referredto as “PrimateAI 3D.” “Primate AI” is a commonly owned and previouslydisclosed variant pathogenicity classifier that makes pathogenicitypredictions based protein sequences. Additional details about PrimateAIcan be found in commonly owned U.S. patent application Ser. Nos.16/160,903; 16/160,986; 16/160,968; and 16/407,149 and in Sundaram, L etal. Predicting the clinical impact of human mutation with deep neuralnetworks. Nat. Genet. 50, 1161-1170 (2018).

FIGS. 28, 29, 30, and 31 use PrimateAI as a benchmark model todemonstrate PrimateAI 3D's classification superiority over PrimateAI.The performance results in FIGS. 28, 29, 30, and 31 are generated on theclassification task of accurately distinguishing benign variants frompathogenic variants across a plurality of validation sets. PrimateAI 3Dis trained on training sets that are different from the plurality ofvalidation sets. PrimateAI 3D is trained on common human variants andvariants from primates used as benign dataset while simulated variantsbased on trinucleotide context used as unlabeled or pseudo-pathogenicdataset.

New developmental delay disorder (new DDD) is one example of avalidation set used to compare the classification accuracy of Primate AI3D against Primate AI. The new DDD validation set labels variants fromindividuals with DDD as pathogenic and labels the same variants fromhealthy relatives of the individuals with the DDD as benign. A similarlabelling scheme is used with an autism spectrum disorder (ASD)validation set shown in FIG. 31 .

BRCA1 is another example of a validation set used to compare theclassification accuracy of Primate AI 3D against Primate AI. The BRCA1validation set labels synthetically generated reference amino acidsequences simulating proteins of the BRCA1 gene as benign variants andlabels synthetically altered allele amino acid sequences simulatingproteins of the BRCA1 gene as pathogenic variants. A similar labellingscheme is used with different validation sets of the TP53 gene, TP53S3gene and its variants, and other genes and their variants shown in FIG.31 .

FIG. 28 identifies performance of the benchmark PrimateAI model withblue horizontal bars and performance of the disclosed PrimateAI 3D modelwith orange horizontal bars. Green horizontal bars depict pathogenicitypredictions derived by combining respective pathogenicity predictions ofthe disclosed PrimateAI 3D model and the benchmark PrimateAI model. Inthe legend, “ens10” denotes an ensemble of ten PrimateAI 3D models, eachtrained with a different seed training dataset and randomly initializedwith different weights and biases. Also, “7×7×7×2” depicts the size ofthe voxel grid used to encode the input channels during the training ofthe ensemble of ten PrimateAI 3D models. For a given variant, theensemble of ten PrimateAI 3D models respectively generates tenpathogenicity predictions, which are subsequently combined (e.g., byaveraging) to generate a final pathogenicity prediction for the givenvariant. This logic analogous applies to ensembles of different groupsizes.

Also, in FIG. 28 , the y-axis has the different validation sets and thex-axis has p-values. Greater p-values, i.e., longer horizontal barsdenote greater accuracy in differentiating benign variants frompathogenic variants. As demonstrated by the p-values in FIG. 28 ,PrimateAI 3D outperforms PrimateAI across most of the validation sets(only exception being the tp53s3_A549 validation set). That is, theorange horizontal bars for PrimateAI 3D are consistently longer than theblue horizontal bars for PrimateAI.

Also, in FIG. 28 , a “mean” category along the y-axis calculates themean of the p-values determined for each of the validation sets. In themean category as well, PrimateAI 3D outperforms PrimateAI.

In FIG. 29 , PrimateAI is represented by blue horizontal bars, anensemble of twenty PrimateAI 3D models trained with a voxel grid of size3×3×3 is represented by red horizontal bars, an ensemble of tenPrimateAI 3D models trained with a voxel grid of size 7×7×7×2 isrepresented by purple horizontal bars, an ensemble of twenty PrimateAI3D models trained with a voxel grid of size 7×7×7×2 is represented bybrown horizontal bars, and an ensemble of twenty PrimateAI 3D modelstrained with a voxel grid of size 17×17×17×2 is represented by purplehorizontal bars.

Also, in FIG. 29 , the y-axis has the different validation sets and thex-axis has p-values. As before, greater p-values, i.e., longerhorizontal bars denote greater accuracy in differentiating benignvariants from pathogenic variants. As demonstrated by the p-values inFIG. 20 , different configurations of PrimateAI 3D outperform PrimateAIacross most of the validation sets. That is, the red, purple, brown, andpink horizontal bars for PrimateAI 3D are mostly longer than the bluehorizontal bars for PrimateAI.

Also, in FIG. 29 , a “mean” category along the y-axis calculates themean of the p-values determined for each of the validation sets. In themean category as well, the different configurations of PrimateAI 3Doutperform PrimateAI.

In FIG. 30 , the red vertical bars represent PrimateAI, and the cyanvertical bars represent PrimateAI 3D. In FIG. 30 , the y-axis hasp-values, and the x-axis has the different validation sets. In FIG. 30 ,without exceptions, PrimateAI 3D consistently outperforms PrimateAIacross all of the validation sets. That is, the cyan vertical bars forPrimateAI 3D are always longer than the red vertical bars for PrimateAI.

FIG. 31 identifies performance of the benchmark PrimateAI model withblue vertical bars and performance of the disclosed PrimateAI 3D modelwith orange vertical bars. Green vertical bars depict pathogenicitypredictions derived by combining respective pathogenicity predictions ofthe disclosed PrimateAI 3D model and the benchmark PrimateAI model. InFIG. 31 , the y-axis has p-values, and the x-axis has the differentvalidation sets.

As demonstrated by the p-values in FIG. 31 , PrimateAI 3D outperformsPrimateAI across most of the validation sets (only exception being thetp53s3_A549p53NULL_Nutlin-3 validation set). That is, the orangevertical bars for PrimateAI 3D are consistently longer than the bluevertical bars for PrimateAI.

Also, in FIG. 31 , a separate “mean” chart calculates the mean of thep-values determined for each of the validation sets. In the mean chartas well, PrimateAI 3D outperforms PrimateAI.

The mean statistics may be biased by outliers. To address this, aseparate “method ranks” chart is also depicted in FIG. 31 . Higher rankdenotes poorer classification accuracy. In the method ranks chart aswell, PrimateAI 3D outperforms PrimateAI by having more counts of lowerranks 1 and 2 versus Primate AI having all 3s.

In FIGS. 28 to 31 , it is also evident that combining PrimateAI 3D withPrimateAI produces superior classification accuracy. That is, a proteincan be fed as an amino acid sequence to PrimateAI to generate a firstoutput, and the same protein can be fed as a 3D, voxelized proteinstructure to PrimateAI 3D to generate a second output, and the first andsecond outputs can be combined or analyzed in aggregate to produce afinal pathogenicity prediction for a variant experienced by the protein.

Efficient Voxelization

FIG. 32 is a flowchart illustrating an efficient voxelization process3200 that efficiently identifies nearest atoms on a voxel-by-voxelbasis.

The discussion now revisits the distance channels. As discussed above,the reference amino acid sequence 202 can contain different types ofatoms, such as alpha-carbon atoms, beta-carbon atoms, oxygen atoms,nitrogen atoms, hydrogen atoms, and so on. Accordingly, as discussedabove, the distance channels can be arranged by nearest alpha-carbonatoms, nearest beta-carbon atoms, nearest oxygen atoms, nearest nitrogenatoms, nearest hydrogen atoms, and so on. For example, in FIG. 6 , eachof the nine voxels 514 has twenty-one amino acid-wise distance channelsfor nearest alpha-carbon atoms. FIG. 6 can be further expanded for eachof the nine voxels 514 to also have twenty-one amino acid-wise distancechannels for nearest beta-carbon atoms, and for each of the nine voxels514 to also have a nearest generic atom distance channel for a nearestatom irrespective of the type of the atom and the type of the aminoacid. This way, each of the nine voxels 514 can have forty-threedistance channels.

The discussion now turns to the number of distance calculations requiredto identify the nearest atoms on a voxel-by-voxel basis for inclusion inthe distance channels. Consider the example in FIG. 3 that depicts atotal of eight hundred and twenty-eight alpha-carbon atoms distributedacross the twenty-one amino acid categories. To calculate the aminoacid-wise distance channels 602-642 in FIG. 6 , i.e. , to determine theone hundred and eighty-nine distance values, distances are measured fromeach of the nine voxels 514 to each of the eight hundred andtwenty-eight alpha-carbon atoms, resulting in 9*828=7, 452 distancecalculations. In the 3D case of twenty-seven voxels, this results in828*27=22, 356 distance calculations. When the eight hundred andtwenty-eight beta-carbon atoms are also included, this number increasesto 27*1656=44, 712 distance calculations.

This means that the runtime complexity of identifying the nearest atomson a voxel-by-voxel basis for a single protein voxelization isO(#atoms*#voxels), as illustrated by FIG. 35A. Furthermore, the runtimecomplexity for a single protein voxelization increases toO(#atoms*#voxels*#attributes) when the distance channels are calculatedacross a variety of attributes (e.g., different features or channels pervoxel like annotation channels and structural confidence channels).

Consequently, the distance calculations can become the mostcompute-consuming part of the voxelization process, taking valuablecompute resources away from critical runtime tasks like model trainingand model inference. Consider, for example, the case of model trainingwith a training dataset of 7,000 proteins. Generating distance channelsfor a plurality of voxels across a plurality of amino acids, atoms, andattributes can involve more than 100 voxelizations per protein,resulting in about 800,000 voxelizations in a single training iteration(epoch). A training run of 20-40 epochs, with rotation of atomiccoordinates in each epoch, can result in as many as 32 millionvoxelizations.

In addition to the high compute cost, the size of the data for 32million voxelizations is too big to fit in main memory (e.g., >20TB fora 15×15×15 voxel grid). Considering repeated training runs for parameteroptimization and ensemble learning, the memory footprint of thevoxelization process gets too big to be stored on disk, making thevoxelization process a part of the model training and not aprecomputation step.

The technology disclosed provides an efficient voxelization process thatachieves up to ˜100× speedup over the runtime complexity ofO(#atoms*#voxels). The disclosed efficient voxelization process reducesthe runtime complexity for a single protein voxelization to O(#atoms).In the case of different features or channels per voxel, the disclosedefficient voxelization process reduces the runtime complexity for asingle protein voxelization to O(#atoms*#attributes). As a result, thevoxelization process becomes as fast as model training, shifting thecomputational bottleneck from voxelization back to computing neuralnetwork weights on processors such as GPUs, ASICs, TPUs, FPGAs, CGRAs,etc.

In some implementations of the disclosed efficient voxelization processinvolving large voxel grids, the runtime complexity for a single proteinvoxelization is O(#atoms+voxels) and O(#atoms*#attributes+voxels) forthe case of different features or channels per voxel. The “+voxels”complexity is observed when the number of atoms is minuscule compared tothe number of voxels, for example, when there is one atom in a100×100×100 voxel grid (i.e., one million voxels per atom). In such ascenario, the runtime is dominated by the overhead of the huge number ofvoxels, for example, for allocating the memory for one million voxels,initialization one million voxels to zero, etc.

The discussion now turns to details of the disclosed efficientvoxelization process. FIGS. 32A, 32B, 33, 34, and 35B are discussed intandem.

Starting with FIG. 32A, at step 3202, each atom (e.g., each of the 828alpha-carbon atoms and each of the 828 beta-carbon atoms) is associatedwith a voxel that contains the atom (e.g., one of the nine voxels 514).The term “contains” refers to the 3D atomic coordinates of the atombeing located in the voxel. The voxel that contains the atom is alsoreferred to herein as “the atom-containing voxel.”

FIGS. 32B and 33 describe how a voxel that contains a particular atom isselected. FIG. 33 uses 2D atomic coordinates as representative of 3Datomic coordinates. Note that the voxel grid 522 is regularly spacedwith each of the voxels 514 having a same step size (e.g., 1 angstrom(Å) or 2 Å).

Also, in FIG. 33 , the voxel grid 522 has magenta indices [0, 1, 2]along a first dimension (e.g., x-axis) and cyan indices [0, 1, 2] alonga second dimension (e.g., y-axis). Also, in FIG. 33 , the respectivevoxels 514 in the voxel 512 are identified by green voxel indices [Voxel0, Voxel 1, . . . , Voxel 8] and by black voxel center indices [(1, 1),(1, 2), . . . , (3, 3)].

Also, in FIG. 33 , center coordinates of the voxel centers along thefirst dimension, i.e., first dimension voxel coordinates, are identifiedin orange. Also, in FIG. 33 , center coordinates of the voxel centersalong the second dimension, i.e., second dimension voxel coordinates,are identified in red.

First, at step 3202 a (Step 1 in FIG. 33 ), 3D atomic coordinates(1.7456, 2.14323) of the particular atom are quantized to generatedquantized 3D atomic coordinates (1.7, 2.1). The quantization can beachieved by rounding or truncation of bits.

Then, at step 3202 b (Step 2 in FIG. 33 ), voxel coordinates (or voxelcenters or voxel center coordinates) of the voxels 514 are assigned tothe quantized 3D atomic coordinates on a dimension-basis. For the firstdimension, the quantized atomic coordinate 1.7 is assigned to Voxel 1because it covers first dimension voxel coordinates ranging from 1 to 2and is centered at 1.5 in the first dimension. Note that Voxel 1 hasindex 1 along the first dimension, in contrast to having index 0 alongthe second dimension.

For the second dimension, starting from Voxel 1, the voxel grid 522 istraversed along the second dimension. This results in the quantizedatomic coordinate 2.5 being assigned to Voxel 7 because it covers seconddimension voxel coordinates ranging from 2 to 3 and is centered at 2.5in the second dimension. Note that Voxel 7 has index 2 along the seconddimension, in contrast to having index 1 along the first dimension.

Then, at step 3202 c (Step 3 in FIG. 33 ), dimension indicescorresponding to the assigned voxel coordinates are selected. That is,for Voxel 1, index 1 is selected along the first dimension, and, forVoxel 7, index 2 is selected along the second dimension. A personskilled in the art will appreciate that the above steps can beanalogously executed for a third dimension to select a dimension indexalong the third dimension.

Then, at step 3202 d (Step 4 in FIG. 33 ), an accumulated sum isgenerated based on position-wise weighting the selected dimensionindices by powers of a radix. The general idea behind positionalnumbering systems is that a numeric value is represented throughincreasing powers of the radix (or base), for example, binary is basetwo, ternary is base three, octal is base eight, and hexadecimal is basesixteen. This is often referred to as a weighted numbering systembecause each position is weighted by a power of the radix. The set ofvalid numericals for a positional numbering system is equal in size tothe radix of that system. For example, there are ten digits in thedecimal system, zero through nine, and three digits in the ternarysystem, zero, one, and two. The largest valid number in a radix systemis one smaller than the radix (so eight is not a valid numerical in anyradix system smaller than nine). Any decimal integer can be expressedexactly in any other integral base system, and vice-versa.

Returning to the example in FIG. 33 , the selected dimension indices 1and 2 are converted to a single integer by position-wise multiplyingthem with respective powers of base three and summing the results of theposition-wise multiplications. Base three is selected here because the3D atomic coordinates have three dimensions (although FIG. 33 shows only2D atomic coordinates along two dimensions for simplicity's sake).

Since index 2 is positioned at the rightmost bit (i.e., the leastsignificant bit), it is multiplied by three to the power of zero toyield two. Since index 1 is positioned at the second rightmost bit(i.e., the second least significant bit), it is multiplied by three tothe power of one to yield three. This results in the accumulated sumbeing five.

Then, at step 3202 e (Step 5 in FIG. 33 ), based on the accumulated sum,a voxel index of the voxel containing the particular atom is selected.That is, the accumulated sum is interpreted as the voxel index of thevoxel containing the particular atom.

At step 3212, after each atom is associated with the atom-containingvoxel, each atom is further associated with one or more voxels that arein a neighborhood of the atom-containing voxel, also referred to hereinas “neighborhood voxels.” The neighborhood voxels can be selected basedon being within a predefined radius of the atom-containing voxel (e.g.,5 angstrom (Å)). In other implementations, the neighborhood voxels canbe selected based on being contiguously adjacent to the atom-containingvoxel (e.g., top, bottom, right, left adjacent voxels). The resultingassociation that associates each atom with the atom-containing voxel andthe neighborhood voxels is encoded in an atom-to-voxels mapping 3402,also referred to herein as element-to-cells mapping. In one example, afirst alpha-carbon atom is associated with a first subset of voxels 3404that includes an atom-containing voxel and neighborhood voxels for thefirst alpha-carbon atom. In another example, a second alpha-carbon atomis associated with a second subset of voxels 3406 that includes anatom-containing voxel and neighborhood voxels for the secondalpha-carbon atom.

Note that no distance calculations are made to determine theatom-containing voxel and the neighborhood voxels. The atom-containingvoxel is selected by virtue of the spatial arrangement of the voxelsthat allows assignment of quantized 3D atomic coordinates tocorresponding regularly spaced voxel centers in the voxel grid (withoutusing any distance calculations). Also, the neighborhood voxels areselected by virtue of being spatially contiguous to the atom-containingvoxel in the voxel grid (again without using any distance calculations).

At step 3222, each voxel is mapped to atoms to which it was associatedat steps 3202 and 3212. In one implementation, this mapping is encodedin a voxel-to-atoms mapping 3412, which is generated based on theatom-to-voxels mapping 3402 (e.g., by applying a voxel-based sorting keyon the atom-to-voxels mapping 3402). The voxel-to-atoms mapping 3412 isalso referred to herein as “cell-to-elements mapping.” In one example, afirst voxel is mapped to a first subset of alpha-carbon atoms 3414 thatincludes alpha-carbon atoms associated with the first voxel at steps3202 and 3212. In another example, a second voxel is mapped to a secondsubset of alpha-carbon atoms 3416 that includes alpha-carbon atomsassociated with the second voxel at steps 3202 and 3212.

At step 3232, for each voxel, distances are calculated between the voxeland atoms mapped to the voxel at step 3222. Step 3232 has a runtimecomplexity of O(#atoms) because distance to a particular atom ismeasured only once from a respective voxel to which the particular atomis uniquely mapped in the voxel-to-atoms mapping 3412. This is true whenno neighboring voxels are considered. Without neighbors, the constantfactor that is implied in the big-O notation is 1. With neighbors, thebig-O notation is equal to the number of neighbors +1 since the numberof neighbors is constant for each voxel, and therefore the runtimecomplexity of O(#atoms) remains true. In contrast, in FIG. 35A,distances to a particular atom are redundantly measured as many times asthe number of voxels (e.g., 27 distances for a particular atom due to 27voxels).

In FIG. 35B, based on the voxel-to-atoms mapping 3412, each voxel ismapped to a respective subset of the 828 atoms (not including distancecalculations to neighborhood voxels), as illustrated by respective ovalsfor respective voxels. The respective subsets are largelynon-overlapping, with some exceptions. Insignificant overlap exists dueto some instances when multiple atoms are mapped to a same voxel, asindicated in FIG. 35B by the prime symbol “”' and the yellow overlapbetween the ovals. This minimal overlap has an additive effect on theruntime complexity of O(#atoms) and not a multiplicative effect. Thisoverlap is a result of considering neighboring voxels, after determiningthe voxel that contains the atom. Without neighboring voxels, there canbe no overlap, because an atom is only associated with one voxel.Considering neighbors, however, each neighbor could potentially beassociated with the same atom (as long as there is no other atom of thesame amino acid that is closer).

At step 3242, for each voxel, based on the distances calculated at step3232, a nearest atom to the voxel is identified. In one implementation,this identification is encoded in a voxel-to-nearest atom mapping 3422,also referred to herein as “cell-to-nearest element mapping.” In oneexample, the first voxel is mapped to a second alpha-carbon atom as itsnearest alpha-carbon atom 3424. In another example, the second voxel ismapped to a thirty-first alpha-carbon atom as its nearest alpha-carbonatom 3426.

Furthermore, as the voxel-wise distances are calculated using thetechnique discussed above, the atom-type and amino acid-typecategorization of the atoms and the corresponding distance values arestored to generate categorized distance channels.

Once the distances to nearest atoms are identified using the techniquediscussed above, these distances can be encoded in the distance channelsfor voxelization and subsequent processing by the pathogenicityclassifier 2108.

Computer System

FIG. 36 shows an example computer system 3600 that can be used toimplement the technology disclosed. Computer system 3600 includes atleast one central processing unit (CPU) 3672 that communicates with anumber of peripheral devices via bus subsystem 3655. These peripheraldevices can include a storage subsystem 3610 including, for example,memory devices and a file storage subsystem 3636, user interface inputdevices 3638, user interface output devices 3676, and a networkinterface subsystem 3674. The input and output devices allow userinteraction with computer system 3600. Network interface subsystem 3674provides an interface to outside networks, including an interface tocorresponding interface devices in other computer systems.

In one implementation, the pathogenicity classifier 2108 is communicablylinked to the storage subsystem 3610 and the user interface inputdevices 3638.

User interface input devices 3638 can include a keyboard; pointingdevices such as a mouse, trackball, touchpad, or graphics tablet; ascanner; a touch screen incorporated into the display; audio inputdevices such as voice recognition systems and microphones; and othertypes of input devices. In general, use of the term “input device” isintended to include all possible types of devices and ways to inputinformation into computer system 3600.

User interface output devices 3676 can include a display subsystem, aprinter, a fax machine, or non-visual displays such as audio outputdevices. The display subsystem can include an LED display, a cathode raytube (CRT), a flat-panel device such as a liquid crystal display (LCD),a projection device, or some other mechanism for creating a visibleimage. The display subsystem can also provide a non-visual display suchas audio output devices. In general, use of the term “output device” isintended to include all possible types of devices and ways to outputinformation from computer system 3600 to the user or to another machineor computer system.

Storage subsystem 3610 stores programming and data constructs thatprovide the functionality of some or all of the modules and methodsdescribed herein. These software modules are generally executed byprocessors 3678.

Processors 3678 can be graphics processing units (GPUs),field-programmable gate arrays (FPGAs), application-specific integratedcircuits (ASICs), and/or coarse-grained reconfigurable architectures(CGRAs). Processors 3678 can be hosted by a deep learning cloud platformsuch as Google Cloud Platform™, Xilinx™, and Cirrascale™. Examples ofprocessors 3678 include Google's Tensor Processing Unit (TPU)™,rackmount solutions like GX4 Rackmount Series™, GX36 Rackmount Series™,NVIDIA DGX-1™, Microsoft′Stratix V FPGA™, Graphcore's IntelligentProcessor Unit (IPU)™, Qualcomm's Zeroth Platform™ with Snapdragonprocessors™, NVIDIA's Volta™, NVIDIA's DRIVE PX™, NVIDIA's JETSONTX1/TX2 MODULE™, Intel's Nirvana™, Movidius VPU™, Fujitsu DPI™, ARM'sDynamiclQ™, IBM TrueNorth™, Lambda GPU Server with Testa V100s™, andothers.

Memory subsystem 3622 used in the storage subsystem 3610 can include anumber of memories including a main random access memory (RAM) 3632 forstorage of instructions and data during program execution and a readonly memory (ROM) 3634 in which fixed instructions are stored. A filestorage subsystem 3636 can provide persistent storage for program anddata files, and can include a hard disk drive, a floppy disk drive alongwith associated removable media, a CD-ROM drive, an optical drive, orremovable media cartridges. The modules implementing the functionalityof certain implementations can be stored by file storage subsystem 3636in the storage subsystem 3610, or in other machines accessible by theprocessor.

Bus subsystem 3655 provides a mechanism for letting the variouscomponents and subsystems of computer system 3600 communicate with eachother as intended. Although bus subsystem 3655 is shown schematically asa single bus, alternative implementations of the bus subsystem can usemultiple busses.

Computer system 3600 itself can be of varying types including a personalcomputer, a portable computer, a workstation, a computer terminal, anetwork computer, a television, a mainframe, a server farm, awidely-distributed set of loosely networked computers, or any other dataprocessing system or user device. Due to the ever-changing nature ofcomputers and networks, the description of computer system 3600 depictedin FIG. 36 is intended only as a specific example for purposes ofillustrating the preferred implementations of the present invention.Many other configurations of computer system 3600 are possible havingmore or less components than the computer system depicted in FIG. 36 .

Amino Acid Prediction

Protein language models trained with the masked language modelingobjective are supervised to output the probability that an amino acidoccurs at a position in a protein given the surrounding context.Proteins are linear polymers that fold into various specificconformations to function. The incredible variety of three-dimensional(3D) structures determined by the combination and order in which 20amino acids thread the protein polymer chain (sequence of the protein)enables the sophisticated functionality of proteins responsible for mostbiological activities. Hence, obtaining the structures of proteins is ofparamount importance in both understanding the fundamental biology ofhealth and disease and developing therapeutic molecules. While proteinstructure is primarily determined by sophisticated experimentaltechniques, such as X-ray crystallography, NMR spectroscopy and,increasingly, cryo-electron microscopy, computational structureprediction from the genetically encoded amino acid sequence of a proteinhas been used as an alternative when experimental approaches arelimited.

Computational methods have been used to predict the structure ofproteins, to illustrate the mechanism of biological processes, and todetermine the properties of proteins. Furthermore, all naturallyoccurring proteins are a result of an evolutionary process of randomvariants arising under various selective pressures. Through thisprocess, nature has explored only a small subset of theoreticallypossible protein sequence space. Advances in machine learning,especially deep learning, are catalyzing a revolution in the paradigm ofscientific research. Some deep learning-based approaches, especially instructure prediction, now outperform conventional methods, often incombination with higher-resolution physical modeling. Challenges remainin experimental validation, benchmarking, leveraging known physics andinterpreting models, and extending to other biomolecules and contexts.

Protein sites are microenvironments within a protein structure,distinguished by their structural or functional role. A site can bedefined by a three-dimensional location and a local neighborhood aroundthis location in which the structure or function exists. Central torational protein engineering is the understanding of how the structuralarrangement of amino acids creates functional characteristics withinprotein sites. Determination of the structural and functional roles ofindividual amino acids within a protein provides information to helpengineer and alter protein functions. Identifying functionally orstructurally important amino acids allows focused engineering effortssuch as site-directed mutagenesis for altering targeted proteinfunctional properties. In one implementation, the technology disclosedrelates to predicting spatial tolerability of amino acid substitutes. Insuch an implementation, the technology disclosed includes a gappinglogic and a substitution logic. The gapping logic is configured toremove, from a protein, a particular amino acid at a particularposition, and create an amino acid vacancy at the particular position inthe protein. The substitution logic is configured to process the proteinwith the amino acid vacancy, and score tolerability of substitute aminoacids that are candidates for filling/fitting the amino acid vacancy.The substitution logic is further configured to score the tolerabilityof the substitute amino acids based at least in part on structural (orspatial) compatibility between the substitute amino acids and adjacentamino acids in a neighborhood of the amino acid vacancy (e.g., the rightand left flanking amino acids). The substitution logic evaluates theextent to which an amino acid “fits” its surrounding protein environmentand shows that mutations that disrupt strong amino acid preferences aremore likely to be deleterious. When the substitution logic is aconvolutional neural network, during the training process, the weightsof the convolutional filters are optimized to detect local spatialpatterns that best capture the local biochemical features to separatethe 20 amino acid microenvironments. After the training process, filtersin convolution layers of the convolutional neural network are activatedwhen the desired features are present at some spatial position in theinput. The structural (or spatial) compatibility can be defined bychanges to or impact on protein functionality. When a substitute aminoacid, after substitution at a specific location within a proteinstructure, causes changes in the functionality of a protein, then thesubstitute amino acid is considered structurally (or spatially)incompatible. When a substitute amino acid, after substitution at thespecific location within the protein structure, does not cause changesin the functionality of a protein, then the substitute amino acid isconsidered structurally (or spatially) compatible. The structural (orspatial) compatibility can be defined by a spatial deviation measured bya distance metric. First, a pre-insertion spatial measurement of aprotein structure can be determined, for example, by measuring distancesbetween amino acids in the protein structure prior to the amino acidsubstitution at a particular position. The distances can be atomicdistances based on atomic coordinates of the atoms of the amino acids.The distances can be measured between pairs of amino acids. Then, apost-insertion spatial measure of the protein structure be determined,for example, by remeasuring the distances between the amino acids in theprotein structure after the amino acid substitution at the particularposition. When the spatial deviation between the pre-insertion spatialmeasurement and the post-insertion spatial measure exceeds a threshold,then the substitute amino acid is considered structurally (or spatially)incompatible. When the spatial deviation between the pre-insertionspatial measurement and the post-insertion spatial measure does notexceed the threshold, then the substitute amino acid is consideredstructurally (or spatially) compatible.

In another implementation, the technology disclosed relates topredicting evolutionary conservation of amino acid substitutes. In suchan implementation, the technology disclosed includes a gapping logic anda substitution logic. The gapping logic is configured to remove, from aprotein, a particular amino acid at a particular position, and create anamino acid vacancy at the particular position in the protein. Thesubstitution logic is configured to process the protein with the aminoacid vacancy, and score evolutionary conservation of substitute aminoacids that are candidates for filling the amino acid vacancy. Thesubstitution logic is further configured to score the evolutionaryconservation of the substitute amino acids based at least in part onstructural (or spatial) compatibility between the substitute amino acidsand adjacent amino acids in a neighborhood of the amino acid vacancy(e.g., the right and left flanking amino acids). In someimplementations, the evolutionary conservation is scored usingevolutionary conservation frequencies. In one implementation, theevolutionary conservation frequencies are based on a position-specificfrequency matrix (PSFM). In another implementation, the evolutionaryconservation frequencies are based on a position-specific scoring matrix(PSSM). In one implementation, evolutionary conservation scores of thesubstitute amino acids are rank-ordered by magnitude.

In yet another implementation, the technology disclosed relates topredicting evolutionary conservation of amino acid substitutes. In suchan implementation, the technology disclosed includes a gapping logic andan evolutionary conservation prediction logic. The gapping logic isconfigured to remove, from a protein, a particular amino acid at aparticular position, and create an amino acid vacancy at the particularposition in the protein. The evolutionary conservation prediction logicis configured to process the protein with the amino acid vacancy, andrank evolutionary conservation of substitute amino acids that arecandidates for filling the amino acid vacancy.

Gapped Protein Spatial Representation-Based Pathogenicity Determinationfor a Target Alternate Amino Acid

FIG. 37 illustrates one implementation of determining 3700 variantpathogenicity for a target alternate amino acid based on processing agapped protein spatial representation. A protein is a sequence of aminoacids. A particular amino acid in the protein that is removed or maskedfrom the protein is called a “gap amino acid.” The resulting proteinthat lacks the gap amino acid is called a “gapped protein” or a“vacancy-containing protein.”

A “spatial representation” of a protein characterizes structuralinformation about amino acids in the protein. The spatial representationof the protein can be based on shape, location, position, patterns,and/or arrangement of the amino acids in the protein. The spatialrepresentation of the protein can be one-dimensional (1D),two-dimensional (2D), three-dimensional (3D), or n-dimensional (nD)information.

In one implementation, the spatial representation of the proteinincludes the amino acid-wise distance channels discussed above, forexample, the amino acid-wise distance channels 600 described above withrespect to FIG. 6 . In another implementation, the spatialrepresentation of the protein includes the distance channel tensordiscussed above, for example, the distance channel tensor 700 describedabove with respect to FIG. 7 . In yet another implementation, thespatial representation of the protein includes the evolutionary profilestensor discussed above, for example, the evolutionary profiles tensor1800 described above with respect to FIG. 18 . In yet anotherimplementation, the spatial representation of the protein includes thevoxelized annotation channels discussed above, for example, thevoxelized annotation channels 2000 described above with respect to FIG.20 . In yet another implementation, the spatial representation of theprotein includes the structure confidence channels discussed above. Inother implementations, the spatial representation can include otherchannels as well.

A “gapped spatial representation” of a protein is such a spatialrepresentation of the protein that excludes at least one gap amino acidin the protein. In one implementation, a gap amino acid is excluded byexcluding (or not considering or ignoring) one or more atoms oratom-types of the gap amino acid when generating the gapped spatialrepresentation. For example, the atoms of the gap amino acid can beexcluded from the calculations (or selections or computations) thatproduce the distance channels, the evolutionary profiles, the annotationchannels, and/or the structure confidence channels. In otherimplementations, the gapped spatial representation can be generated byexcluding the gap amino acid from other feature channels as well.

Consider the following example of generating a gapped spatialrepresentation of a protein by excluding atoms of a gap amino acid fromcalculations of the amino acid-wise distance channels. In FIG. 5 , theCα^(A5) atom belongs to the Alanine amino acid at position five in theprotein. Now assume that this Alanine amino acid at the fifth positionis selected as the gap amino acid. Then the gapped spatialrepresentation is generated by calculating the distance channel by notaccounting for the distance 512 between the center of voxel (1, 1) ofvoxel grid 522 and the nearest alpha-carbon (C_(α)) atom, which is theCα^(A5) atom of the gap amino acid, i.e., the Alanine amino acid at thefifth position.

Also note that this Application uses “spatial representation of aprotein” and “protein structure” interchangeably. Also note that thisApplication uses “gapped spatial representation of a protein” and“gapped protein structure” interchangeably.

Turning to FIG. 37 , at action 3702, a protein sequence accessor 3704accesses a protein that has respective amino acids at respectivepositions.

At action 3712, a gap amino acid specifier 3714 specifies a particularamino acid at a particular position in the protein as a gap amino acid,and specifies remaining amino acids at remaining positions in theprotein as non-gap amino acids. In one implementation, the particularamino acid is a reference amino acid that is a major allele of theprotein.

At action 3722, a gapped spatial representation generator 3724 generatesa gapped spatial representation of the protein that includes spatialconfigurations of the non-gap amino acids, and excludes a spatialconfiguration of the gap amino acid. The spatial configurations of thenon-gap amino acids are encoded as amino acid class-wise distancechannels. Each of the amino acid class-wise distance channels hasvoxel-wise distance values for voxels in a plurality of voxels. Thevoxel-wise distance values specify distances from corresponding voxelsin the plurality of voxels to atoms of the non-gap amino acids. Thespatial configurations of the non-gap amino acids are determined basedon spatial proximity between the corresponding voxels and the atoms ofthe non-gap amino acids. The spatial configuration of the gap amino acidis excluded from the gapped spatial representation by disregardingdistances from the corresponding voxels to atoms of the gap amino acidwhen determining the voxel-wise distance values. The spatialconfiguration of the gap amino acid is excluded from the gapped spatialrepresentation by disregarding spatial proximity between thecorresponding voxels and the atoms of the gap amino acid.

The spatial configurations of the non-gap amino acids are encoded asevolutionary profile channels based on pan-amino acid conservationfrequencies of amino acids with nearest atoms to the voxels. In oneimplementation, the spatial configuration of the gap amino acid isexcluded from the gapped spatial representation by disregarding nearestatoms of the gap amino acid when determining the pan-amino acidconservation frequencies. The spatial configurations of the non-gapamino acids are encoded as evolutionary profile channels based onper-amino acid conservation frequencies of respective amino acids withrespective nearest atoms to the voxels. In one implementation, thespatial configuration of the gap amino acid is excluded from the gappedspatial representation by disregarding respective nearest atoms of thegap amino acid when determining the per-amino acid conservationfrequencies. The spatial configurations of the non-gap amino acids areencoded as annotation channels. In one implementation, the spatialconfiguration of the gap amino acid is excluded from the gapped spatialrepresentation by disregarding atoms of the gap amino acid whendetermining the annotation channels. The spatial configurations of thenon-gap amino acids are encoded as structural confidence channels. Inone implementation, the spatial configuration of the gap amino acid isexcluded from the gapped spatial representation by disregarding atoms ofthe gap amino acid when determining the structural confidence channels.The spatial configurations of the non-gap amino acids are encoded asadditional input channels. In one implementation, the spatialconfiguration of the gap amino acid is excluded from the gapped spatialrepresentation by disregarding atoms of the gap amino acid whendetermining the additional input channels.

At action 3732, a pathogenicity determiner 3734, determines apathogenicity of a nucleotide variant based at least in part on thegapped spatial representation, and a representation of an alternateamino acid created by the nucleotide variant at the particular position.The representation of the alternate amino acid can be a one-hot encodingof the alternate amino acid (e.g., see FIG. 8 ). In someimplementations, the alternate amino acid is an amino acid that is sameas the reference amino acid. In other implementations, the alternateamino acid is an amino acid that is different from the reference aminoacid.

FIG. 38 shows an example of a spatial representation 3800 of a protein.The protein contains an amino acid sequence 3804. An Aspartic acid (D)amino acid at a 22^(nd) position in the amino acid sequence 3804 isselected as a gap amino acid 3802. FIG. 39 shows an example of a gappedspatial representation 3900 of the protein illustrated in FIG. 38 . InFIG. 39 , the gap amino acid 3802 is removed from the gapped spatialrepresentation 3900. Also in FIG. 39 , the absence of the gap amino acid3802 is illustrated as a missing gap amino acid 3902.

FIG. 40 shows an example of an atomic spatial representation 4000 of theprotein illustrated in FIG. 38 . FIG. 40 also depicts atoms 4002 of thegap amino acid 3802. FIG. 41 shows an example of a gapped atomic spatialrepresentation 4100 of the protein illustrated in FIG. 38 . In FIG. 41 ,the atoms 4002 of the gap amino acid 3802 are removed from the gappedatomic spatial representation 4100. Also in FIG. 41 , the absence of theatoms 4002 of the gap amino acid 3802 is illustrated as missing atoms4102 of the gap amino acid 3802.

Also note that this Application uses “pathogenicity determiner,”“pathogenicity predictor,” “pathogenicity classifier,” “variantpathogenicity classifier,” “evolutionary conservation predictor,” and“evolutionary conservation determiner” interchangeably.

FIG. 42 illustrates one implementation of a pathogenicity classifier2108/2600/2700 determining 4200 variant pathogenicity for a targetalternate amino acid based on processing a gapped protein spatialrepresentation 4202 and an alternate amino acid representation 4212 ofthe target alternate amino acid.

The pathogenicity classifier 2108/2600/2700 determines the pathogenicityof the nucleotide variant by processing, as input, the gapped spatialrepresentation 4202, and the representation of the alternate amino acid3212, and generating, as output, a pathogenicity score 4208 for thealternate amino acid.

FIG. 43 depicts one implementation of training data 4300 used to trainthe pathogenicity classifier 2108/2600/2700. The pathogenicityclassifier 2108/2600/2700 is trained on a benign training set 4302. Thebenign training set 4302 has respective benign protein samples 4322,4342, and 4362 for respective reference amino acids at respectivepositions 4312, 4332, and 4352 in a proteome. The reference amino acidsare major allele amino acids of the proteome. In one implementation, theproteome has ten million positions, and therefore the benign trainingset 4302 has ten million benign protein samples. The respective benignprotein samples have respective gapped spatial representations generatedby using the respective reference amino acids as respective gap aminoacids. The respective benign protein samples have respectiverepresentations of the respective reference amino acids as respectivealternate amino acids. In various implementations, the proteome includeshuman proteome and non-human proteome, including non-human primateproteome.

FIG. 44 illustrates one implementation of generating 4400 gapped spatialrepresentations 4322G, 4342G, and 4362G for reference proteins samples4322, 4342, and 4362 by using reference amino acids 4402, 4412, and 4422as gap amino acids, respectively. FIG. 45 shows one implementation oftraining the pathogenicity classifier 2108/2600/2700 on benign proteinsamples 4500.

The pathogenicity classifier 2108/2600/2700 trains on a particularbenign protein sample and estimates a pathogenicity of a particularreference amino acid at a particular position in the particular benignprotein sample by processing, as input, (i) a particular gapped spatialrepresentation 4322G of the particular benign protein sample, and (ii) arepresentation 4402 (e.g., a one-hot encoding) of the particularreference amino acid as a particular alternate amino acid, andgenerating, as output, a pathogenicity score for the particularreference amino acid. The particular gapped spatial representation isgenerated by using the particular reference amino acid as a gap aminoacid, and by using remaining amino acids at remaining positions in theparticular benign protein sample as non-gap amino acids.

Each of the benign protein samples has a ground truth benignness label4506 that indicates absolute benignness of the benign protein samples.In one implementation, the ground truth benignness label is zero, one,or minus one. The pathogenicity score 4502 for the particular referenceamino acid is compared against the ground truth benignness label todetermine an error 4504, and to improve coefficients of thepathogenicity classifier 2108/2600/2700 based on the error using atraining technique (e.g., backpropagation 4512).

The pathogenicity classifier 2108/2600/2700 is trained on a pathogenictraining set 4308. The pathogenic training set 4308 has respectivepathogenic protein samples 4322A-N, 4342A-N, and 4362A-N for respectivecombinatorically generated amino acid substitutions for each of thereference amino acids 4312, 4332, and 4352 at each of the respectivepositions 4318, 4338, and 4358 in the proteome. In one implementation,the respective combinatorically generated amino acid substitutions areconfined by reachability of single nucleotide polymorphisms (SNPs) totransform a reference codon of a reference amino acid into alternateamino acids of unreachable alternate amino acid classes. Thecombinatorically generated amino acid substitutions for a particularreference amino acid of a particular amino acid class at a particularposition in the proteome include respective alternate amino acids ofrespective amino acid classes that are different from the particularamino acid class.

In one implementation, the proteome has the ten million positions,wherein there are nineteen combinatorically generated amino acidsubstitutions for each of the ten million positions, and therefore thepathogenic training set 4308 has one hundred and ninety millionpathogenic protein samples.

The respective pathogenic protein samples have respective gapped spatialrepresentations generated by using the respective reference amino acidsas respective gap amino acids. The respective pathogenic protein sampleshave respective representations of the respective combinatoricallygenerated amino acid substitutions as respective alternate amino acidscreated by respective combinatorically generated nucleotide variants atthe respective positions in the proteome.

FIG. 46 shows one implementation of training the pathogenicityclassifier 2108/2600/2700 on pathogenic protein samples 4600. Thepathogenicity classifier 2108/2600/2700 trains on a particularpathogenic protein sample and estimates a pathogenicity of a particularcombinatorically generated amino acid substitution for a particularreference amino acid at a particular position in the particularpathogenic protein sample by processing, as input, (i) a particulargapped spatial representation 4322G of the particular pathogenic proteinsample, and (ii) a representation 4622 (e.g., a one-hot encoding) of theparticular combinatorically generated amino acid substitution as aparticular alternate amino acid, and generating, as output, apathogenicity score for the particular combinatorically generated aminoacid substitution. The particular gapped spatial representation isgenerated by using the particular reference amino acid as a gap aminoacid, and by using remaining amino acids at remaining positions in theparticular pathogenic protein sample as non-gap amino acids.

Each of the pathogenic protein samples has a ground truth pathogenicitylabel that indicates absolute pathogenicity of the pathogenic proteinsamples. In one implementation, the ground truth pathogenicity label isone, zero, or minus one, as long as it is different (e.g., opposite)than the ground truth benignness label. The pathogenicity score 4602 forthe particular combinatorically generated amino acid substitution iscompared against the ground truth pathogenicity label 4606 to determinean error 4604, and to improve the coefficients of the pathogenicityclassifier 2108/2600/2700 based on the error using the trainingtechnique (e.g., backpropagation 4612).

In one implementation, the pathogenicity classifier 2108/2600/2700 istrained on two hundred million training iterations. In such animplementation, the two hundred million training iterations include tenmillion training iterations with the ten million benign protein samples,and one hundred and ninety million iterations with the one hundred andninety million pathogenic protein samples. In one implementation, theproteome has one million to ten million positions, and therefore thebenign training set has one million to ten million benign proteinsamples. In such an implementation, there are nineteen combinatoricallygenerated amino acid substitutions for each of the one million to tenmillion positions, and therefore the pathogenic training set hasnineteen million to one hundred and ninety million pathogenic proteinsamples.

In one implementation, the pathogenicity classifier 2108/2600/2700 istrained on twenty million to two hundred million training iterations. Insuch an implementation, the twenty million to two hundred milliontraining iterations include one million to ten million trainingiterations with the one million to ten million benign protein samples,and nineteen million to one hundred and ninety million iterations withthe nineteen million to one hundred and ninety million pathogenicprotein samples.

FIG. 47 shows how certain unreachable amino acid classes are masked 4700during training. At action 4702, those unreachable alternate amino acidclasses that are confined by reachability of single nucleotidepolymorphisms (SNPs) to transform a reference codon of a reference aminoacid into alternate amino acids of the unreachable alternate amino acidclasses are masked in ground truth labels. At action 4712, the maskedamino acid classes result in zero loss and do not contribute to gradientupdates. At action 4722, the masked amino acid classes are identified ina lookup table. At action 4732, the lookup table identifies a set ofmasked amino acids classes for each reference amino acid position.

FIG. 48 illustrates one implementation of determining a finalpathogenicity score. At action 4802, in one implementation, thepathogenicity classifier 2108/2600/2700 generates a first pathogenicityscore for a first alternate amino acid that is same as a first referenceamino acid. At action 4812, in one implementation, the pathogenicityclassifier 2108/2600/2700 generates a second pathogenicity score for asecond alternate amino acid that is different from the first referenceamino acid. At action 4822, in one implementation, a final pathogenicityscore for the second alternate amino acid is the second pathogenicityscore for the second alternate amino acid.

In other alternatives, the final pathogenicity score for the secondalternate amino acid is based on a combination of the firstpathogenicity score and the second pathogenicity score. In a firstalternative at 4822 a, in one implementation, the final pathogenicityscore for the second alternate amino acid is a ratio of the secondpathogenicity score over a sum of the first pathogenicity score and thesecond pathogenicity score. In a second alternative at 4822 b, in oneimplementation, the final pathogenicity score for the second alternateamino acid is determined by subtracting the first pathogenicity scorefrom the second pathogenicity score.

The discussion so far covered what is depicted in FIG. 49A. FIG. 49Ashows that a variant pathogenicity determination is made for a targetalternate amino acid 4922 filling a vacancy created by a reference gapamino acid 4902 at a given position in a protein 4912. In particular,this analysis is done by spatially representing the protein 4912 and thevacancy in a 3D format, for example, by using voxelized amino acidcategory-wise distance calculations that exclude the reference gap aminoacid 4902 (or atoms thereof).

The discussion now turns to FIG. 49B. FIG. 49B shows that respectivevariant pathogenicity determinations are made for amino acids ofrespective amino acid classes 4916 filing the vacancy created by thereference gap amino acid 4902 at the given position in the protein 4912.The inputs in FIGS. 49A and 49B are the same; only the output isdifferent, and so are the spatial representations of the protein 4912and the vacancy in the 3D format. In FIG. 49A only one pathogenicityscore is generated; whereas in FIG. 49B a pathogenicity score isgenerated for each of the twenty amino acid classes/categories (e.g., byusing a 20-way softmax classification).

Gapped Protein Spatial Representation-Based Pathogenicity Determinationfor Multiple Alternate Amino Acids

FIG. 50 illustrates one implementation of determining 5000 variantpathogenicity for multiple alternate amino acids based on processing agapped protein spatial representation. At action 5002, the proteinsequence accessor 3704 accesses a protein that has respective aminoacids at respective positions.

At action 5012, the gap amino acid specifier 3714 specifies a particularamino acid at a particular position in the protein as a gap amino acid,and specifies remaining amino acids at remaining positions in theprotein as non-gap amino acids. In one implementation, the particularamino acid is a reference amino acid that is a major allele of theprotein.

At action 5022, the gapped spatial representation generator 3724generates a gapped spatial representation of the protein that includesspatial configurations of the non-gap amino acids, and excludes aspatial configuration of the gap amino acid. The spatial configurationsof the non-gap amino acids are encoded as amino acid class-wise distancechannels. Each of the amino acid class-wise distance channels hasvoxel-wise distance values for voxels in a plurality of voxels. Thevoxel-wise distance values specify distances from corresponding voxelsin the plurality of voxels to atoms of the non-gap amino acids. Thespatial configurations of the non-gap amino acids are determined basedon spatial proximity between the corresponding voxels and the atoms ofthe non-gap amino acids. The spatial configuration of the gap amino acidis excluded from the gapped spatial representation by disregardingdistances from the corresponding voxels to atoms of the gap amino acidwhen determining the voxel-wise distance values. The spatialconfiguration of the gap amino acid is excluded from the gapped spatialrepresentation by disregarding spatial proximity between thecorresponding voxels and the atoms of the gap amino acid.

The spatial configurations of the non-gap amino acids are encoded asevolutionary profile channels based on pan-amino acid conservationfrequencies of amino acids with nearest atoms to the voxels. In oneimplementation, the spatial configuration of the gap amino acid isexcluded from the gapped spatial representation by disregarding nearestatoms of the gap amino acid when determining the pan-amino acidconservation frequencies. The spatial configurations of the non-gapamino acids are encoded as evolutionary profile channels based onper-amino acid conservation frequencies of respective amino acids withrespective nearest atoms to the voxels. In one implementation, thespatial configuration of the gap amino acid is excluded from the gappedspatial representation by disregarding respective nearest atoms of thegap amino acid when determining the per-amino acid conservationfrequencies. The spatial configurations of the non-gap amino acids areencoded as annotation channels. In one implementation, the spatialconfiguration of the gap amino acid is excluded from the gapped spatialrepresentation by disregarding atoms of the gap amino acid whendetermining the annotation channels. The spatial configurations of thenon-gap amino acids are encoded as structural confidence channels. Inone implementation, the spatial configuration of the gap amino acid isexcluded from the gapped spatial representation by disregarding atoms ofthe gap amino acid when determining the structural confidence channels.The spatial configurations of the non-gap amino acids are encoded asadditional input channels. In one implementation, the spatialconfiguration of the gap amino acid is excluded from the gapped spatialrepresentation by disregarding atoms of the gap amino acid whendetermining the additional input channels.

At action 5032, the pathogenicity determiner 3734, determines, based atleast in part on the gapped spatial representation, a pathogenicity ofrespective alternate amino acids at the particular position. Therespective alternate amino acids are respective combinatoricallygenerated alternate amino acids created by respective combinatoricallygenerated nucleotide variants at the particular position.

FIG. 51 illustrates one implementation of the pathogenicity classifier2108/2600/2700 determining 5100 variant pathogenicity for multiplealternate amino acids based on processing a gapped protein spatialrepresentation 5102. The pathogenicity classifier 2108/2600/2700determines the pathogenicity of the respective alternate amino acids byprocessin1g, as input, the gapped spatial representation 5102, andgenerating, as output, respective pathogenicity scores 1-20 forrespective amino acid classes. In some implementations, the respectiveamino acid classes correspond to respective twenty naturally-occurringamino acids. In other implementations, the respective amino acid classescorrespond to respective naturally-occurring amino acids from a subsetof the twenty naturally-occurring amino acids. In one implementation,the output is displayed with the respective rankings of the respectivepathogenicity scores 1-20 for respective amino acid classes.

FIG. 52 illustrates one implementation of concurrently training 5200 thepathogenicity classifier 2108/2600/2700 on benign and pathogenic proteinsamples. The pathogenicity classifier 2108/2600/2700 is trained on atraining set. The training set has respective protein samples forrespective positions in the proteome. The proteome has ten millionpositions, and therefore the training set has ten million proteinsamples. The respective protein samples have respective gapped spatialrepresentations generated by using respective reference amino acids atthe respective positions in proteome as respective gap amino acids. Thereference amino acids are major allele amino acids of the proteome.

The pathogenicity classifier 2108/2600/2700 trains on a particularprotein sample and estimates a pathogenicity of respective alternateamino acids for a particular reference amino acid at a particularposition in the particular protein sample by processing, as input, aparticular gapped spatial representation 5202 of the particular proteinsample, and generating, as output, respective pathogenicity scores 1-20for the respective amino acid classes. The particular gapped spatialrepresentation is generated by using the particular reference amino acidas a gap amino acid, and by using remaining amino acids at remainingpositions in the particular protein sample as non-gap amino acids.

Each of the protein samples has respective ground truth labels for therespective amino acid classes. The respective ground truth labelsinclude an absolute benignness label for a reference amino acid class inthe respective amino acid classes, and include respective absolutepathogenicity labels for respective alternate amino acid classes in therespective amino acid classes. In one implementation, the absolutebenignness label is zero. The absolute pathogenicity labels are sameacross the respective alternate amino acid classes. In oneimplementation, the absolute pathogenicity labels are one.

In one implementation, an error 5204 is determined based on a comparisonof a pathogenicity score for the reference amino acid class against theabsolute benignness label (e.g., pathogenicity score 8 for reference gapamino acid 5212 in FIG. 52 ), and respective comparisons of respectivepathogenicity scores for the respective alternate amino acid classesagainst the respective absolute pathogenicity labels (e.g.,pathogenicity scores 1-7 and 9-20 in FIG. 52 ). In one implementation,coefficients of the pathogenicity classifier 2108/2600/2700 are improvedbased on the error using a training technique (e.g., backpropagation5224).

In one implementation, the pathogenicity classifier 2108/2600/2700 istrained on ten million training iterations with the ten million proteinsamples. In some implementations, the proteome has one million to tenmillion positions, and therefore the training set has one million to tenmillion protein samples. In one implementation, the pathogenicityclassifier 2108/2600/2700 is trained on one million to ten milliontraining iterations with the one million to ten million protein samples.

In one implementation, the pathogenicity classifier 2108/2600/2700generates a reference pathogenicity score for a first alternate aminoacid of the reference amino acid class. In one implementation, thepathogenicity classifier 2108/2600/2700 generates respective alternatepathogenicity scores for respective alternate amino acids of therespective alternate amino acid classes.

In one implementation, respective final alternate pathogenicity scoresfor the respective alternate amino acids are the respective alternatepathogenicity scores. In one implementation, respective final alternatepathogenicity scores for the respective alternate amino acids are basedon respective combinations of the reference pathogenicity score and therespective alternate pathogenicity scores. In one implementation,respective final alternate pathogenicity scores for the respectivealternate amino acids are respective ratios of the respective alternatepathogenicity scores over a sum of the reference pathogenicity score andthe respective alternate pathogenicity scores. In one implementation,respective final alternate pathogenicity scores for the respectivealternate amino acids are determined by respectively subtracting thereference pathogenicity score from the respective alternatepathogenicity scores.

In one implementation, the pathogenicity classifier 2108/2600/2700 hasan output layer that generates the respective pathogenicity scores. Insome implementations, the output layer is a normalization layer. In suchimplementations, the respective pathogenicity scores are normalized. Inone implementation, the output layer is a softmax layer. In such animplementation, the respective pathogenicity scores are exponentiallynormalized. In another implementation, the output layer has respectivesigmoid units that respectively generate the respective pathogenicityscores. In yet another implementation, the respective pathogenicityscores are unnormalized.

Gapped Protein Spatial Representation- and EvolutionaryConservation-Based Pathogenicity Determination for Multiple AlternateAmino Acids

Evolutionary conservation refers to the presence of similar genes,portions of genes, or chromosome segments in different species,reflecting both the common origin of species and an important functionalproperty of the conserved element. Mutations occur spontaneously in eachgeneration, randomly changing an amino acid here and there in a protein.Individuals with mutations that impair critical functions of proteinsmay have resulting problems that make them less able to reproduce.Harmful mutations are lost from the gene pool because the individualscarrying them reproduce less effectively. Since the harmful mutationsare lost, the amino acids critical for the function of a protein areconserved in the gene pool. In contrast, harmless (or very rarebeneficial) mutations are kept in the gene pool, producing variabilityin non-critical amino acids. Evolutionary conservation in proteins isidentified by aligning the amino acid sequences of proteins with thesame function from different taxa (orthologs). Predicting the functionalconsequences of variants relies at least in part on the assumption thatcrucial amino acids for protein families are conserved through evolutiondue to negative selection (i.e., amino acid changes at these sites weredeleterious in the past), and that mutations at these sites have anincreased likelihood of being pathogenic (causing disease) in humans Ingeneral, homologous sequences of a target protein are collected andaligned, and a metric of conservation is computed based on the weightedfrequencies of different amino acids observed in the target position inthe alignment. FIG. 53 illustrates one implementation of determining5300 variant pathogenicity for multiple alternate amino acids based onprocessing a gapped protein spatial representation and, in response,generating evolutionary conservation scores for the multiple alternateamino acids. At action 5302, the gap amino acid specifier 3714 specifiesa particular amino acid at a particular position in a protein as a gapamino acid, and specifies remaining amino acids at remaining positionsin the protein as non-gap amino acids. In one implementation, theparticular amino acid is a reference amino acid that is a major alleleof the protein.

At action 5312, the gapped spatial representation generator 3724generates a gapped spatial representation of the protein that includesspatial configurations of the non-gap amino acids, and excludes aspatial configuration of the gap amino acid. The spatial configurationsof the non-gap amino acids are encoded as amino acid class-wise distancechannels. Each of the amino acid class-wise distance channels hasvoxel-wise distance values for voxels in a plurality of voxels. Thevoxel-wise distance values specify distances from corresponding voxelsin the plurality of voxels to atoms of the non-gap amino acids. Thespatial configurations of the non-gap amino acids are determined basedon spatial proximity between the corresponding voxels and the atoms ofthe non-gap amino acids. The spatial configuration of the gap amino acidis excluded from the gapped spatial representation by disregardingdistances from the corresponding voxels to atoms of the gap amino acidwhen determining the voxel-wise distance values. The spatialconfiguration of the gap amino acid is excluded from the gapped spatialrepresentation by disregarding spatial proximity between thecorresponding voxels and the atoms of the gap amino acid.

The spatial configurations of the non-gap amino acids are encoded asevolutionary profile channels based on pan-amino acid conservationfrequencies of amino acids with nearest atoms to the voxels. In oneimplementation, the spatial configuration of the gap amino acid isexcluded from the gapped spatial representation by disregarding nearestatoms of the gap amino acid when determining the pan-amino acidconservation frequencies. The spatial configurations of the non-gapamino acids are encoded as evolutionary profile channels based onper-amino acid conservation frequencies of respective amino acids withrespective nearest atoms to the voxels. In one implementation, thespatial configuration of the gap amino acid is excluded from the gappedspatial representation by disregarding respective nearest atoms of thegap amino acid when determining the per-amino acid conservationfrequencies. The spatial configurations of the non-gap amino acids areencoded as annotation channels. In one implementation, the spatialconfiguration of the gap amino acid is excluded from the gapped spatialrepresentation by disregarding atoms of the gap amino acid whendetermining the annotation channels. The spatial configurations of thenon-gap amino acids are encoded as structural confidence channels. Inone implementation, the spatial configuration of the gap amino acid isexcluded from the gapped spatial representation by disregarding atoms ofthe gap amino acid when determining the structural confidence channels.The spatial configurations of the non-gap amino acids are encoded asadditional input channels. In one implementation, the spatialconfiguration of the gap amino acid is excluded from the gapped spatialrepresentation by disregarding atoms of the gap amino acid whendetermining the additional input channels.

At action 5322, an evolutionary conservation determiner 5324 determinesan evolutionary conservation at the particular position of respectiveamino acids of respective amino acid classes based at least in part onthe gapped spatial representation.

FIG. 54 shows the evolutionary conservation determiner 5324 in operation5400, in accordance with one implementation. The evolutionaryconservation determiner 5324, in some implementations, has the samearchitecture as the pathogenicity classifier 2108/2600/2700. Theevolutionary conservation determiner 5324 determines the evolutionaryconservation by processing, as input, the gapped spatial representation5402, and generating, as output, respective evolutionary conservationscores 5406 for the respective amino acids 5408. The respectiveevolutionary conservation scores are rankable by magnitude. For purposesof the present disclosure, a “classifier”, “determiner”, “insert termhere” can include one or more software modules, one or more hardwaremodules, or any combination thereof.

At action 5332, the pathogenicity determiner 3734, based at least inpart on the evolutionary conservation of the respective amino acids5408, determines a pathogenicity of respective nucleotide variants thatrespectively substitute the particular amino acid with the respectiveamino acids 5408 in alternate representations of the protein.

FIG. 55 illustrates one implementation of determining pathogenicitybased on predicted evolutionary scores. A classifier 5516 classifies anucleotide variant as pathogenic 5508 when an evolutionary conservationscore generated by the evolutionary conservation determiner 5324 for acorresponding amino acid substitution is below a threshold. In oneimplementation, the classifier 5516 classifies a nucleotide variant aspathogenic 5508 when an evolutionary conservation score generated by theevolutionary conservation determiner 5324 for a corresponding amino acidsubstitution is zero (i.e., indication of non-conservation).

The classifier 5516 classifies a nucleotide variant as benign 5528 whenan evolutionary conservation score generated by the evolutionaryconservation determiner 5324 for a corresponding amino acid substitutionis above a threshold. In one implementation, the classifier 5516classifies a nucleotide variant as benign 5528 when an evolutionaryconservation score generated by the evolutionary conservation determiner5324 for a corresponding amino acid substitution is non-zero (i.e.,indication of conservation).

FIG. 56 illustrates one implementation of training data 5600 used totrain the evolutionary conservation determiner 5324. The evolutionaryconservation determiner 5324 is trained on a conserved training set anda non-conserved training set. The conserved training set has respectiveconserved protein samples 5602 for respective conserved amino acids atrespective positions in a proteome. The non-conserved training set hasrespective non-conserved (or unconserved) protein samples 5608 forrespective non-conserved amino acids at the respective positions. Invarious implementations, the proteome includes human proteome andnon-human proteome, including non-human primate proteome.

Each of the respective positions has a set of conserved amino acids anda set of non-conserved amino acids. A particular set of conserved aminoacids for a particular position in a particular protein in the proteomeincludes at least one major allele amino acid observed at the particularposition across a plurality of species. In one implementation, the majorallele amino acid is a reference amino acid (e.g., REF allele 5612spanning benign protein sample 5622, and REF allele 5662 spanning benignprotein sample 5682). The particular set of conserved amino acidsincludes one or more minor allele amino acids observed at the particularposition across the plurality of species (e.g., observed ALT alleles5632 spanning benign protein samples 5642, 5652, 5662, and observed ALTalleles 5692 spanning benign protein samples 5695, 5696).

A particular set of non-conserved amino acids for the particularposition includes amino acids not in the particular set of conservedamino acids (e.g., unobserved ALT alleles 5618 spanning pathogenicprotein samples 5622A-N, and unobserved ALT alleles 5668 spanningpathogenic protein samples 5682A-N).

In one implementation, each of the respective positions has C conservedamino acids in the set of conserved amino acids. In such animplementation, each of the respective positions has NC non-conservedamino acids in the set of non-conserved amino acids, where NC=20−C. Theconserved training set has CP conserved protein samples, where CP=anumber of the respective positions*C. The non-conserved training set hasNCP non-conserved protein samples, where NCP=the number of therespective positions*(20−C). In one implementation, the C ranges fromone to ten. In another implementation, the C varies across therespective positions. In yet another implementation, the C is same forsome of the respective positions.

In one implementation, the proteome has one to ten million positions. Insuch an implementation, each of the one to ten million positions has theC conserved amino acids in the set of conserved amino acids. Each of theone to ten million positions has the NC non-conserved amino acids in theset of non-conserved amino acids, where NC=20−C. The conserved trainingset has the CP conserved protein samples, where CP=one to ten million*C.The non-conserved training set has the NCP non-conserved proteinsamples, where NCP=one to ten million*(20−C).

In one implementation, the evolutionary conservation determiner 5324 istrained on twenty million to two hundred million training iterations. Insuch an implementation, the twenty million to two hundred milliontraining iterations include one million to ten million trainingiterations with the one million to ten million conserved proteinsamples, and nineteen million to one hundred and ninety millioniterations with the nineteen million to one hundred and ninety millionnon-conserved protein samples.

In another implementation, the proteome has one million to ten millionpositions, and therefore the training set has one million to ten millionprotein samples. In such an implementation, the evolutionaryconservation determiner 5324 is trained on one million to ten milliontraining iterations with the one million to ten million protein samples.

The respective conserved and non-conserved protein samples haverespective gapped spatial representations generated by using respectivereference amino acids at the respective positions as respective gapamino acids. The evolutionary conservation determiner 5324 trains on aparticular conserved protein sample and estimates an evolutionaryconservation of a particular conserved amino acid at a particularposition in the particular conserved protein sample by processing, asinput, a particular gapped spatial representation of the particularconserved protein sample, and generating, as output, an evolutionaryconservation score for the particular conserved amino acid. Theparticular gapped spatial representation is generated by using aparticular reference amino acid at the particular position as a gapamino acid, and by using remaining amino acids at remaining positions inthe particular conserved protein sample as non-gap amino acids.

Each of the conserved protein samples has a ground truth conservedlabel. The ground truth conserved label is an evolutionary conservationfrequency. In one implementation, the ground truth conserved label isone. The evolutionary conservation for the particular conserved aminoacid is compared against the ground truth conserved label to determinean error, and to improve coefficients of the evolutionary conservationdeterminer 5324 based on the error using a training technique. In oneimplementation, the training technique is a loss function-based gradientupdate technique (e.g., backpropagation).

In some implementations, the ground truth conserved label is masked andnot used to determine the error when the particular conserved amino acidis the particular reference amino acid. In such implementations, themasking causes the evolutionary conservation determiner 5324 to notoverfit on the particular reference amino acid.

The evolutionary conservation determiner 5324 trains on a particularnon-conserved protein sample and estimates an evolutionary conservationof a particular non-conserved amino acid at a particular position in theparticular non-conserved protein sample by processing, as input, aparticular gapped spatial representation of the particular non-conservedprotein sample, and generating, as output, an evolutionary conservationscore for the particular non-conserved amino acid. The particular gappedspatial representation is generated by using a particular referenceamino acid at the particular position as a gap amino acid, and by usingremaining amino acids at remaining positions in the particularnon-conserved protein sample as non-gap amino acids.

Each of the non-conserved protein samples has a ground truthnon-conserved label. The ground truth non-conserved label is anevolutionary conservation frequency. In one implementation, the groundtruth non-conserved label is zero. The evolutionary conservation scorefor the particular non-conserved amino acid is compared against theground truth non-conserved label to determine an error, and to improvethe coefficients of the evolutionary conservation determiner 5324 basedon the error using the training technique (e.g., backpropagation).

The evolutionary conservation determiner 5324 is trained on a trainingset. The training set has respective protein samples for the respectivepositions in the proteome. The respective protein samples haverespective gapped spatial representations generated by using therespective reference amino acids at the respective positions as therespective gap amino acids.

FIG. 57 illustrates one implementation of concurrently training 5700 theevolutionary conservation determiner on benign and pathogenic proteinsamples. The evolutionary conservation determiner 5324 trains on aparticular protein sample and estimates an evolutionary conservation ofrespective amino acids of respective amino acid classes at a particularposition in the particular protein sample by processing, as input, aparticular gapped spatial representation 5722 of the particular proteinsample, and generating, as output, respective evolutionary conservationscores 1-20 for the respective amino acids. The particular gappedspatial representation 5722 is generated by using a particular referenceamino acid at the particular position as a gap amino acid, and by usingremaining amino acids at remaining positions in the particular proteinsample as non-gap amino acids.

Each of the protein samples has respective ground truth labels for therespective amino acids. The respective ground truth labels include oneor more conserved (benign) labels for one or more conserved amino acids5732, 5702, 5712, in the respective amino acids, and include one or morenon-conserved (pathogenic) labels for one or more non-conserved aminoacids in the respective amino acids. The conserved labels and thenon-conserved labels have respective evolutionary conservationfrequencies. The respective evolutionary conservation frequencies arerankable according to magnitude. In one implementation, the conservedlabels are ones, and the non-conserved labels are zeros.

In one implementation, an error 5704 is determined based on respectivecomparisons of respective evolutionary conservation scores for therespective conserved amino acids against the respective conserved aminoacids, and respective comparisons of respective evolutionaryconservation scores for the respective non-conserved amino acids againstthe respective non-conserved amino acids. The coefficients of theevolutionary conservation determiner 5324 are improved based on theerror using the training technique (e.g., backpropagation 5744).

In one implementation, the conserved amino acids include the particularreference amino acid, and a conserved label for the particular referenceamino acid is masked and not used to determine the error. The maskingcauses the evolutionary conservation determiner 5324 to not overfit onthe particular reference amino acid.

Synonymous mutations are point mutations, meaning they are just amiscopied DNA nucleotide that only changes one base pair in the RNA copyof the DNA. A codon in RNA is a set of three nucleotides that encode aspecific amino acid. Most amino acids have several RNA codons thattranslate into that particular amino acid. Most of the time, if thethird nucleotide is the one with the mutation, it will result in codingfor the same amino acid. This is called a synonymous mutation because,like a synonym in grammar, the mutated codon has the same meaning as theoriginal codon and therefore does not change the amino acid. If theamino acid does not change, then the protein is also unaffected.Synonymous mutations do not change anything, and no changes are made.That means they have no real role in the evolution of species since thegene or protein is not changed in any way. Synonymous mutations areactually fairly common, but since they have no effect, then they are notnoticed.

Nonsynonymous mutations have a much greater effect on an individual thana synonymous mutation. In a nonsynonymous mutation, there is usually aninsertion or deletion of a single nucleotide in the sequence duringtranscription when the messenger RNA is copying the DNA. This singlemissing or added nucleotide causes a frameshift mutation which throwsoff the entire reading frame of the amino acid sequence and mixes up thecodons. This usually does affect the amino acids that are coded for andchange the resulting protein that is expressed. The severity of thiskind of mutation depends on how early in the amino acid sequence ithappens. If it happens near the beginning and the entire protein ischanged, this could become a lethal mutation. Another way anonsynonymous mutation can occur is if the point mutation changes thesingle nucleotide into a codon that does not translate into the sameamino acid. A lot of times, the single amino acid change does not affectthe protein very much and is still viable. If it happens early in thesequence and the codon is changed to translate into a stop signal, thenthe protein will not be made, and it could cause serious consequences.Sometimes nonsynonymous mutations are actually positive changes. Naturalselection may favor this new expression of the gene and the individualmay have developed a favorable adaptation from the mutation. If thatmutation occurs in the gametes, this adaptation will be passed down tothe next generation of offspring. Nonsynonymous mutations increase thediversity in the gene pool for natural selection to work on and driveevolution on a microevolutionary level.

The nucleotide triplet that encodes an amino acid is called a codon.Each group of three nucleotides encodes one amino acid. Since there are64 combinations of 4 nucleotides taken three at a time and only 20 aminoacids, the code is degenerate (more than one codon per amino acid, inmost cases). One example of the unreachable alternate amino acid classesare those alternate amino acid classes that are not coded by synonymousSNPs. Another example of the unreachable alternate amino acid classesare those alternate amino acid classes that are restricted by the numberof triplet nucleotide mutant combinations deviated away by singlenucleotide polymorphisms (SNPs) at the triplet nucleotide positions froman initial codon.

In one implementation, those unreachable alternate amino acid classesthat are confined by reachability of SNPs to transform a reference codonof a reference amino acid into alternate amino acids of the unreachablealternate amino acid classes are masked in ground truth labels. In suchan implementation, masked amino acid classes result in zero loss and donot contribute to gradient updates. In one implementation, the maskedamino acid classes are identified in a lookup table. In oneimplementation, the lookup table identifies a set of masked amino acidsclasses for each reference amino acid position.

The particular set of conserved amino acids and the particular set ofnon-conserved amino acids are identified based on evolutionaryconservation profiles of homologous proteins of the plurality ofspecies. In one implementation, the evolutionary conservation profilesof the homologous proteins are determined using a position-specificfrequency matrix (PSFM). In another implementation, the evolutionaryconservation profiles of the homologous proteins are determined using aposition-specific scoring matrix (PSSM).

FIG. 58 depicts different implementations of ground truth labelencodings used to train the evolutionary conservation determiner 5324.Ground truth label encoding 5802 uses evolutionary conservationfrequencies (e.g., PSFM or PSSM) to label the conserved amino acidclasses A, C, F, and uses a “zero value” to label the remainingnon-conserved amino acid classes. Ground truth label encoding 5812 isthe same as the ground truth label encoding 5802 except that the groundtruth label encoding 5812 “masks out” the REF majorallele/most-conserved amino acid class F such that the REF majorallele/most-conserved amino acid class F does not contribute to thetraining of the evolutionary conservation determiner 5324 (e.g., byzeroing-out the loss calculated by the loss function for the REF majorallele/most-conserved amino acid class F).

Ground truth label encoding 5822 uses a “one value” to label theconserved amino acid classes A, C, F, and uses a “zero value” to labelthe remaining non-conserved amino acid classes. Ground truth labelencoding 5832 is the same as the ground truth label encoding 5822 exceptthat the ground truth label encoding 5832 “masks out” the REF majorallele/most-conserved amino acid class F such that the REF majorallele/most-conserved amino acid class F does not contribute to thetraining of the evolutionary conservation determiner 5324 (e.g., byzeroing-out the loss calculated by the loss function for the REF majorallele/most-conserved amino acid class F).

FIG. 59 illustrates an example PSFM 5900. FIG. 60 depicts an examplePSSM 6000. FIG. 61 shows one implementation of generating the PSFM andthe PSSM. FIG. 62 illustrates an example PSFM 6200 encoding. FIG. 63depicts an example PSSM 6300 encoding.

Multiple sequence alignment (MSA) is a sequence alignment of multiplehomologous protein sequences to a target protein. MSA is an importantstep in comparative analyses and property prediction of biologicalsequences since a lot of information, for example, evolution andcoevolution clusters, are generated from the MSA and can be mapped tothe target sequence of choice or on the protein structure.

Sequence profiles of a protein sequence X of length L are a L×20 matrix,either in the form of a PSSM or a PSFM. The columns of a PSSM and a PSFMare indexed by the alphabet of amino acids and each row corresponds to aposition in the protein sequence. PSSMs and PSFMs contain thesubstitution scores and the frequencies, respectively, of the aminoacids at different positions in the protein sequence. Each row of a PSFMis normalized to sum to 1. The sequence profiles of the protein sequenceX are computed by aligning X with multiple sequences in a proteindatabase that have statistically significant sequence similarities withX. Therefore, the sequence profiles contain more general evolutionaryand structural information of the protein family that protein sequence Xbelongs to, and thus, provide valuable information for remote homologydetection and fold recognition.

A protein sequence (called query sequence, e.g., a reference amino acidsequence of a protein) can be used as a seed to search and alignhomogenous sequences from a protein database (e.g., SWISSPROT) using,for example, a PSI-BLAST program. The aligned sequences share somehomogenous segments and belong to the same protein family. The alignedsequences are further converted into two profiles to express theirhomogeneous information: PSSM and PSFM. Both PSSM and PSFM are matriceswith 20 rows and L columns, where L is the total number of amino acidsin the query sequence. Each column of a PSSM represents thelog-likelihood of the residue substitutions at the correspondingpositions in the query sequence. The (i, j)-th entry of the PSSM matrixrepresents the chance of the amino acid in the j-th position of thequery sequence being mutated to amino acid type i during the evolutionprocess. A PSFM contains the weighted observation frequencies of eachposition of the aligned sequences. Specifically, the (i, j)-th entry ofthe PSFM matrix represents the possibility of having amino acid type iin position j of the query sequence.

Given a query sequence, we first obtain its sequence profile bypresenting it to PSI-BLAST to search and align homologous proteinsequences from a protein database (e.g., Swiss-Prot Database). FIG. 61shows the procedures of obtaining the sequence profile by using thePSI-BLAST program. The parameters h and j for PSI-BLAST are usually setto 0.001 and 3, respectively. The sequence profile of a proteinencapsulates its homolog information pertaining to a query proteinsequence. In PSI-BLAST, the homolog information is represented by twomatrices: the PSFM and the PSSM. Examples of the PSFM and the PSSM areshown in FIGS. 62 and 63 , respectively.

In FIG. 62 , the (l, u)-th element (l ∈ {1, 2, . . . , Li}, u ∈ {1, 2, .. . , 20}) represents the chance of having the u-th amino acid in thel-th position of the query protein. For example, the chance of havingthe amino acid M in the 1st position of the query protein is 0.36.

In FIG. 63 , the (l, u)-th element (l ∈ {1, 2, . . . , Li}, u ∈ {1, 2, .. . , 20}) represents the likelihood score of the amino acid in the l-thposition of the query protein being mutated to the u-th amino acidduring the evolution process. For example, the score for the amino acidV in the 1st position of the query protein being mutated to H during theevolution process is −3, while that in the 8th position is −4.

Combined Learning and Transfer Learning

FIG. 64 illustrates two datasets on which the models disclosed hereincan be trained, for example, by way of combined learning (FIGS. 65A-B),or by way of transfer learning (FIGS. 66A-B). The first training datasetis called JigsawAI dataset 6406. The second training dataset is calledPrimateAI dataset 6408. The JigsawAI dataset 6406 is characterized by avoxel input 6412 with a missing central residue identified as a gapamino acid, as discussed above. The PrimateAI dataset 6408 ischaracterized by the voxel input 6412 with no missing residues andcomplete input.

For the JigsawAI dataset 6406, ground truth labels 6422 have a missingor masked label 6426 for the gap amino acid (e.g., the REF amino acid).For the PrimateAI dataset 6408, the ground truth labels 6422 havenineteen missing or masked labels 6436 for those remaining amino acidsthat are different from the alternate amino acid-under-analysis (benignor pathogenic). In one implementation, the number of samples 6432 in theJigsawAI dataset 6406 are 10 million 6436, and 1 million 6438 in thePrimateAI dataset 6408.

FIGS. 65A-B illustrate one implementation of combined learning 6500 ofthe models disclosed herein. At action 6502, a gapped training set isaccessed. The gapped training set is also referred to herein as theJigsawAI dataset 6406. The gapped training set includes respectivegapped protein samples for respective positions in a proteome. Therespective gapped protein samples are labelled with respective gappedground truth sequences. A particular gapped ground truth sequence for aparticular gapped protein sample has a benign label for a particularamino acid class that corresponds to a reference amino acid at aparticular position in the particular gapped protein, and has respectivepathogenic labels for respective remaining amino acid classes thatcorrespond to alternate amino acids at the particular position.

At action 6512, a non-gapped training set is accessed. The non-gappedtraining set is also referred to herein as the PrimateAI dataset 6408.The non-gapped training set includes non-gapped benign protein samplesand non-gapped pathogenic protein samples. A particular non-gappedbenign protein sample includes a benign alternate amino acid at aparticular position substituted by a benign nucleotide variant. Aparticular non-gapped pathogenic protein sample includes a pathogenicalternate amino acid at a particular position substituted by apathogenic nucleotide variant. The particular non-gapped benign proteinsample is labelled with a benign ground truth sequence that has a benignlabel for a particular amino acid class that corresponds to the benignalternate amino acid, and respective masked labels for respectiveremaining amino acid classes that correspond to amino acids that aredifferent from the benign alternate amino acid. The particularnon-gapped pathogenic protein sample is labelled with a pathogenicground truth sequence that has a pathogenic label for a particular aminoacid class that corresponds to the pathogenic alternate amino acid, andrespective masked labels for respective remaining amino acid classesthat correspond to amino acids that are different from the pathogenicalternate amino acid.

In one implementation, the benign label for the particular amino acidclass that corresponds to the reference amino acid at the particularposition in the particular gapped protein is masked. In oneimplementation, the non-gapped benign protein samples are derived fromcommon human and non-human primate nucleotide variants. In oneimplementation, the non-gapped pathogenic protein samples are derivedfrom combinatorically simulated nucleotide variants.

At action 6522, respective gapped spatial representations for the gappedprotein samples are generated, and respective non-gapped spatialrepresentations for the non-gapped benign protein samples and thenon-gapped pathogenic protein samples are generated.

At action 6532, the pathogenicity classifier 2108/2600/2700 is trainedover one or more training cycles, and a trained pathogenicity classifier2108/2600/2700 is generated as a result ofparameters/coefficients/weights of the trained pathogenicity classifier2108/2600/2700 being optimized Each of the training cycles uses astraining examples gapped spatial representations from the respectivegapped spatial representations, and non-gapped spatial representationsfrom the respective non-gapped spatial representations.

At action 6542, the trained pathogenicity classifier 2108/2600/2700 isused to determine pathogenicity of variants.

In one implementation, a sample indicator is used to indicate to thepathogenicity classifier 2108/2600/2700 whether a current trainingexample is a gapped spatial representation for a gapped protein sample,or a non-gapped spatial representation for a non-gapped protein sample.

In one implementation, the pathogenicity classifier 2108/2600/2700generates an amino acid class-wise output sequence in response toprocessing a training example. The amino acid class-wise output sequencehas amino acid class-wise pathogenicity scores.

In one implementation, a performance of the trained pathogenicityclassifier 2108/2600/2700 is measured between training cycles over avalidation set. In some implementations, the validation set includes apair of gapped and non-gapped spatial representations for each held-outprotein sample.

In one implementation, the trained pathogenicity classifier2108/2600/2700 generates a first amino acid class-wise output sequencefor the gapped spatial representation in the pair, and a second aminoacid class-wise output sequence for the non-gapped spatialrepresentation in the pair. In some implementations, a finalpathogenicity score for a nucleotide variant that causes an amino acidsubstitution in a held-out protein sample is determined based on acombination of first and second pathogenicity scores for the amino acidsubstitution in the first and second amino acid class-wise outputsequences. In other implementations, the final pathogenicity score isbased on an average of the first and second pathogenicity scores.

In some implementations, at least some of the training cycles use a sameof number of gapped spatial representations and non-gapped spatialrepresentations. In other implementations, at least some of the trainingcycles use batches of training examples that have a same of number ofgapped spatial representations and non-gapped spatial representations.

In one implementation, a masked label does not contribute to errordetermination, and therefore does not contribute to training of thepathogenicity classifier 2108/2600/2700. In some implementations, themasked label is zeroed-out.

In some implementations, the gapped spatial representations are weighteddifferently from the non-gapped spatial representations, such that acontribution of the gapped spatial representations to gradient updatesapplied to parameters of the pathogenicity classifier 2108/2600/2700 inresponse to the pathogenicity classifier 2108/2600/2700 processing thenon-gapped spatial representations varies from a contribution of thenon-gapped spatial representations to gradient updates applied to theparameters of the pathogenicity classifier 2108/2600/2700 in response tothe pathogenicity classifier 2108/2600/2700 processing the non-gappedspatial representations. In one implementation, the variation isdetermined by pre-defined weights.

FIGS. 66A-B illustrate one implementation of using transfer learning6600 to train the models disclosed herein using the two datasets shownin FIG. 64 . At action 6602, the pathogenicity classifier 2108/2600/2700is first trained on the gapped training set (i.e., the JigsawAI data set6406) to generate the trained pathogenicity classifier 2108/2600/2700.

At action 6612, the trained pathogenicity classifier 2108/2600/2700 isfurther trained on the non-gapped training set (i.e., the PrimateAI dataset 6408) to generate a retrained pathogenicity classifier2108/2600/2700.

At action 6622, the retrained pathogenicity classifier 2108/2600/2700 isused to determine pathogenicity of variants.

At action 6632, performance of the trained pathogenicity classifier2108/2600/2700 is measured between training cycles over a firstvalidation set that includes only non-gapped spatial representations ofheld-out protein samples. In another implementation, performance of theretrained pathogenicity classifier 2108/2600/2700 is measured betweentraining cycles over a second validation set that includes gappedspatial representations and non-gapped spatial representations ofheld-out protein samples.

At action 6642, the retrained pathogenicity classifier 2108/2600/2700generates a first amino acid class-wise output sequence for the pair inresponse to processing the pair. In one implementation, a finalpathogenicity score for a nucleotide variant that causes an amino acidsubstitution in a corresponding held-out protein sample is determinedbased on the first amino acid class-wise output sequence.

Generating Training Data and Training Labels

FIG. 67 shows one implementation of generating 6700 training data andlabels to train the models disclosed herein.

A proteome accessor 6704 accesses multitude of amino acid positions in aproteome with a plurality of proteins.

A reference specifier 6714 specifies major allele amino acids at themultitude of amino acid positions as reference amino acids of theplurality of proteins.

A benign labeler 6724, for each amino acid position in the multitude ofamino acids positions, classifies those nucleotide substitutions asbenign variants that substitute a particular reference amino acid withthe particular reference amino acid at a particular amino acid positionin a particular alternate representation of a particular protein.

A pathogenic labeler 6734, for each amino acid position in the multitudeof amino acids positions, classifies those nucleotide substitutions aspathogenic variants that substitute the particular reference amino acidwith alternate amino acids at the particular amino acid position. Thealternate amino acids are different from the particular reference aminoacid.

A trainer 6744 trains a variant pathogenicity classifier 2108/2600/2700on training data comprising spatial representations of protein samples,such that the spatial representations are assigned ground truth benignlabels that correspond to the benign variants, and ground truthpathogenic labels that correspond to the pathogenic variants.

In one implementation, the variant pathogenicity classifier2108/2600/2700 is trained to determine whether a substitution of a firstamino acid with a second amino acid at a given amino acid position in aprotein is pathogenic or benign. In such an implementation, the variantpathogenicity classifier 2108/2600/2700 is trained to generate apathogenicity score for the substitution. In one implementation, thevariant pathogenicity classifier 2108/2600/2700 is trained to determinewhether respective substitutions of a first amino acid with respectiveamino acids at a given amino acid position in a protein are pathogenicor benign. In such an implementation, the variant pathogenicityclassifier 2108/2600/2700 is trained to generate respectivepathogenicity scores for the respective substitutions. In someimplementations, the respective amino acids correspond to respectivetwenty naturally-occurring amino acids. In other implementations, therespective amino acids correspond to respective naturally-occurringamino acids from a subset of the twenty naturally-occurring amino acids.

In one implementation, the variant pathogenicity classifier2108/2600/2700 is trained to determine whether an insertion of an aminoacid at a given vacant amino acid position in a protein is pathogenic orbenign. In such an implementation, the variant pathogenicity classifier2108/2600/2700 is trained to generate a pathogenicity score for theinsertion. In one implementation, the variant pathogenicity classifier2108/2600/2700 is trained to determine whether respective insertions ofrespective amino acids at a given vacant amino acid position in aprotein are pathogenic or benign. In such an implementation, the variantpathogenicity classifier 2108/2600/2700 is trained to generaterespective pathogenicity scores for the respective insertions. In someimplementations, the respective amino acids correspond to respectivetwenty naturally-occurring amino acids. In other implementations, therespective amino acids correspond to respective naturally-occurringamino acids from a subset of the twenty naturally-occurring amino acids.

In one implementation, the variant pathogenicity classifier2108/2600/2700 is trained to determine whether a substitution of a firstamino acid with a second amino acid at a given amino acid position in aprotein is spatially tolerated by other amino acids of the protein ornot. In such an implementation, the variant pathogenicity classifier2108/2600/2700 is trained to generate a spatial tolerance score for thesubstitution. In one implementation, the variant pathogenicityclassifier 2108/2600/2700 is trained to determine whether respectivesubstitutions of a first amino acid with respective amino acids at agiven amino acid position in a protein are spatially tolerated by otheramino acids of the protein or not. In such an implementation, thevariant pathogenicity classifier 2108/2600/2700 is trained to generaterespective spatial tolerance scores for the respective substitutions. Insome implementations, the respective amino acids correspond torespective twenty naturally-occurring amino acids. In otherimplementations, the respective amino acids correspond to respectivenaturally-occurring amino acids from a subset of the twentynaturally-occurring amino acids.

In one implementation, the variant pathogenicity classifier2108/2600/2700 is trained to determine whether an insertion of an aminoacid at a given vacant amino acid position in a protein is spatiallytolerated by other amino acids of the protein or not. In such animplementation, the variant pathogenicity classifier 2108/2600/2700 istrained to generate a spatial tolerance score for the insertion. In oneimplementation, the variant pathogenicity classifier 2108/2600/2700 istrained to determine whether respective insertions of respective aminoacids at a given vacant amino acid position in a protein are spatiallytolerated by other amino acids of the protein or not. In such animplementation, the variant pathogenicity classifier 2108/2600/2700 istrained to generate respective spatial tolerance scores for therespective insertions. In some implementations, the respective aminoacids correspond to respective twenty naturally-occurring amino acids.In other implementations, the respective amino acids correspond torespective naturally-occurring amino acids from a subset of the twentynaturally-occurring amino acids.

In one implementation, the variant pathogenicity classifier2108/2600/2700 is trained to determine whether a substitution of a firstamino acid with a second amino acid at a given amino acid position in aprotein is evolutionary conserved or non-conserved. In such animplementation, the variant pathogenicity classifier 2108/2600/2700 istrained to generate an evolutionary conservation score for thesubstitution. In one implementation, the variant pathogenicityclassifier 2108/2600/2700 is trained to determine whether respectivesubstitutions of a first amino acid with respective amino acids at agiven amino acid position in a protein are evolutionary conserved ornon-conserved. In such an implementation, the variant pathogenicityclassifier 2108/2600/2700 is trained to generate respective evolutionaryconservation scores for the respective substitutions. In someimplementations, the respective amino acids correspond to respectivetwenty naturally-occurring amino acids. In other implementations, therespective amino acids correspond to respective naturally-occurringamino acids from a subset of the twenty naturally-occurring amino acids.

In one implementation, the variant pathogenicity classifier2108/2600/2700 is trained to determine whether an insertion of an aminoacid at a given vacant amino acid position in a protein is evolutionaryconserved or non-conserved. In such an implementation, the variantpathogenicity classifier 2108/2600/2700 is trained to generate anevolutionary conservation score for the insertion.

In one implementation, the variant pathogenicity classifier2108/2600/2700 is trained to determine whether respective insertions ofrespective amino acids at a given vacant amino acid position in aprotein are evolutionary conserved or non-conserved. In such animplementation, the variant pathogenicity classifier 2108/2600/2700 istrained to generate respective evolutionary conservation scores for therespective insertions. In some implementations, the respective aminoacids correspond to respective twenty naturally-occurring amino acids.In other implementations, the respective amino acids correspond torespective naturally-occurring amino acids from a subset of the twentynaturally-occurring amino acids.

In different implementations, spatial tolerance corresponds tostructural tolerance, and spatial intolerance corresponds to structuralintolerance. In different implementations, the multitude of amino acidspositions range from one million to ten million amino acid positions. Indifferent implementations, the multitude of amino acids positions rangefrom ten million to hundred million amino acid positions. In differentimplementations, the multitude of amino acids positions range fromhundred million to a billion amino acid positions. In differentimplementations, the multitude of amino acids positions range from oneto a million amino acid positions.

In one implementation, those unreachable alternate amino acid classesthat are confined by reachability of single nucleotide polymorphisms(SNPs) to transform a reference codon of a reference amino acid intoalternate amino acids of the unreachable alternate amino acid classesare masked in ground truth labels. In such an implementation, maskedamino acid classes result in zero loss and do not contribute to gradientupdates. In such an implementation, the masked amino acid classes areidentified in a lookup table. In such an implementation, the lookuptable identifies a set of masked amino acids classes for each referenceamino acid position.

In different implementations, the spatial representations are structuralrepresentations of protein structures of the protein samples. Indifferent implementations, the spatial representations are encoded usingvoxelization.

Pathogenicity Determination

FIG. 68 illustrates one implementation of a method 6800 of determiningpathogenicity of nucleotide variants. The method includes, at action6802, accessing a spatial representation of a protein. The spatialrepresentation of the protein specifies respective spatialconfigurations of respective amino acids at respective positions in theprotein.

The method includes, at action 6812, removing, from the spatialrepresentation of the protein, a particular spatial configuration of aparticular amino acid at a particular position, thereby generating agapped spatial representation of the protein. In one implementation, theremoval of the particular spatial configuration is implemented (orautomated) by a script.

The method includes, at action 6822, determining a pathogenicity of anucleotide variant based at least in part on the gapped spatialrepresentation, and a representation of an alternate amino acid createdby the nucleotide variant at the particular position.

Structural Tolerability Prediction

FIG. 69 illustrates one implementation of a system 6900 to predictstructural tolerability of amino acid substitutes. At action 6902, agapping logic is configured to remove, from a spatial representation ofa protein, a particular amino acid at a particular position, and createan amino acid vacancy at the particular position in the spatialrepresentation of the protein.

At action 6912, a structural tolerability prediction logic is configuredto process the spatial representation of the protein with the amino acidvacancy, and rank structural tolerability of substitute amino acids thatare candidates for filling the amino acid vacancy based on amino acidco-occurrence patterns in a neighborhood of the amino acid vacancy.

Performance Results as Objective Indicia of Inventiveness andNon-Obviousness

The variant pathogenicity classifier disclosed herein makespathogenicity predictions based on 3D protein structures and is referredto as “PrimateAI 3D.” “Primate AI” is a commonly owned and previouslydisclosed variant pathogenicity classifier that makes pathogenicitypredictions based protein sequences. Additional details about PrimateAIcan be found in commonly owned U.S. patent application Ser. Nos.16/160,903; 16/160,986; 16/160,968; and 16/407,149 and in Sundaram, L etal. Predicting the clinical impact of human mutation with deep neuralnetworks. Nat. Genet. 50, 1161-1170 (2018).

The variant pathogenicity classifier trained using the transfer learningtechnique disclosed herein (FIGS. 66A-B) is referred to as “TransferLearning.” The variant pathogenicity classifier trained using thecombined learning technique disclosed herein (FIGS. 65A-B) is referredto as “Combined Learning.”

The performance results in FIG. 70 are generated on the classificationtask of accurately distinguishing benign variants from pathogenicvariants across a plurality of validation sets. New developmental delaydisorder (new DDD) is one example of a validation set used to comparethe classification accuracy of Transfer Learning against CombinedLearning again Primate AI 3D against Primate AI. The new DDD validationset labels variants from individuals with DDD as pathogenic and labelsthe same variants from healthy relatives of the individuals with the DDDas benign. A similar labelling scheme is used with an autism spectrumdisorder (ASD) validation set.

BRCA1 is another example of a validation set used to compare theclassification accuracy of Transfer Learning against Combined Learningagain Primate AI 3D against Primate AI. The BRCA1 validation set labelssynthetically generated reference amino acid sequences simulatingproteins of the BRCA1 gene as benign variants and labels syntheticallyaltered allele amino acid sequences simulating proteins of the BRCA1gene as pathogenic variants. A similar labelling scheme is used withdifferent validation sets of the TP53 gene, TP53S3 gene and itsvariants, and other genes and their variants shown in FIG. 70 .

In FIG. 70 , the y-axis has p-values, and the x-axis has the differentvalidation sets. As demonstrated by the p-values in FIG. 70 , CombinedLearning generally outperforms other approaches, followed by TransferLearning, which is in turn followed by PrimateAI 3D. Greater p-values,i.e., longer vertical bars denote greater accuracy in differentiatingbenign variants from pathogenic variants. In FIG. 70 , the vertical barsfor Combined Learning are consistently longer than the vertical bars forother approaches.

Also, in FIG. 70 , a separate “mean” chart calculates the mean of thep-values determined for each of the validation sets. In the mean chartas well, Combined Learning generally outperforms other approaches,followed by Transfer Learning, which is in turn followed by PrimateAI3D, as indicated by the horizontal bars for Combined Learning beingconsistently longer than the horizontal bars for other approaches.

The mean statistics may be biased by outliers. To address this, aseparate “method ranks” chart is also depicted in FIG. 70 . Higher rankdenotes poorer classification accuracy. In the method ranks chart aswell, Combined Learning generally outperforms other approaches, followedby Transfer Learning, which is in turn followed by PrimateAI 3D. In themethod ranks chart, having more counts of lower ranks 1 and 2 is betterthan having higher ranks of 3s.

Clauses

The technology disclosed can be practiced as a system, method, orarticle of manufacture. One or more features of an implementation can becombined with the base implementation. Implementations that are notmutually exclusive are taught to be combinable. One or more features ofan implementation can be combined with other implementations. Thisdisclosure periodically reminds the user of these options. Omission fromsome implementations of recitations that repeat these options should notbe taken as limiting the combinations taught in the precedingsections—these recitations are hereby incorporated forward by referenceinto each of the following implementations.

One or more implementations and clauses of the technology disclosed orelements thereof can be implemented in the form of a computer product,including a non-transitory computer readable storage medium withcomputer usable program code for performing the method steps indicated.Furthermore, one or more implementations and clauses of the technologydisclosed or elements thereof can be implemented in the form of anapparatus including a memory and at least one processor that is coupledto the memory and operative to perform exemplary method steps. Yetfurther, in another aspect, one or more implementations and clauses ofthe technology disclosed or elements thereof can be implemented in theform of means for carrying out one or more of the method steps describedherein; the means can include (i) hardware module(s), (ii) softwaremodule(s) executing on one or more hardware processors, or (iii) acombination of hardware and software modules; any of (i)-(iii) implementthe specific techniques set forth herein, and the software modules arestored in a computer readable storage medium (or multiple such media).

The clauses described in this section can be combined as features. Inthe interest of conciseness, the combinations of features are notindividually enumerated and are not repeated with each base set offeatures. The reader will understand how features identified in theclauses described in this section can readily be combined with sets ofbase features identified as implementations in other sections of thisapplication. These clauses are not meant to be mutually exclusive,exhaustive, or restrictive; and the technology disclosed is not limitedto these clauses but rather encompasses all possible combinations,modifications, and variations within the scope of the claimed technologyand its equivalents.

Other implementations of the clauses described in this section caninclude a non-transitory computer readable storage medium storinginstructions executable by a processor to perform any of the clausesdescribed in this section. Yet another implementation of the clausesdescribed in this section can include a system including memory and oneor more processors operable to execute instructions, stored in thememory, to perform any of the clauses described in this section.

We disclose the following clauses:

Clauses Set 1

-   1. A computer-implemented method of determining pathogenicity of    nucleotide variants, including:-   accessing a protein that has respective amino acids at respective    positions;-   specifying a particular amino acid at a particular position in the    protein as a gap amino acid, and specifying remaining amino acids at    remaining positions in the protein as non-gap amino acids;-   generating a gapped spatial representation of the protein that    -   includes spatial configurations of the non-gap amino acids, and    -   excludes a spatial configuration of the gap amino acid; and-   determining a pathogenicity of a nucleotide variant based at least    in part on the gapped spatial representation, and    -   a representation of an alternate amino acid created by the        nucleotide variant at the particular position.-   2. The computer-implemented method of clause 1, wherein the spatial    configurations of the non-gap amino acids are encoded as amino acid    class-wise distance channels,-   wherein each of the amino acid class-wise distance channels has    voxel-wise distance values for voxels in a plurality of voxels, and-   wherein the voxel-wise distance values specify distances from    corresponding voxels in the plurality of voxels to atoms of the    non-gap amino acids.-   3. The computer-implemented method of clause 2, wherein the spatial    configurations of the non-gap amino acids are determined based on    spatial proximity between the corresponding voxels and the atoms of    the non-gap amino acids.-   4. The computer-implemented method of clause 2, wherein the spatial    configuration of the gap amino acid is excluded from the gapped    spatial representation by disregarding distances from the    corresponding voxels to atoms of the gap amino acid when determining    the voxel-wise distance values.-   5. The computer-implemented method of clause 4, wherein the spatial    configuration of the gap amino acid is excluded from the gapped    spatial representation by disregarding spatial proximity between the    corresponding voxels and the atoms of the gap amino acid.-   6. The computer-implemented method of clause 1, wherein the    particular amino acid is a reference amino acid that is a major    allele of the protein.-   7. The computer-implemented method of clause 1, wherein a    pathogenicity predictor determines the pathogenicity of the    nucleotide variant by-   processing, as input,    -   the gapped spatial representation, and    -   the representation of the alternate amino acid; and-   generating, as output, a pathogenicity score for the alternate amino    acid.-   8. The computer-implemented method of clause 7, wherein the    pathogenicity predictor is trained on a benign training set.-   9. The computer-implemented method of clause 8, wherein the benign    training set has respective benign protein samples for respective    reference amino acids at respective positions in a proteome.-   10. The computer-implemented method of clause 9, wherein the    reference amino acids are major allele amino acids of the proteome.-   11. The computer-implemented method of clause 10, wherein the    proteome has ten million positions, and therefore the benign    training set has ten million benign protein samples.-   12. The computer-implemented method of clause 11, wherein the    respective benign protein samples have respective gapped spatial    representations generated by using the respective reference amino    acids as respective gap amino acids.-   13. The computer-implemented method of clause 12, wherein the    respective benign protein samples have respective representations of    the respective reference amino acids as respective alternate amino    acids.-   14. The computer-implemented method of clause 13, wherein the    pathogenicity predictor trains on a particular benign protein sample    and estimates a pathogenicity of a particular reference amino acid    at a particular position in the particular benign protein sample by-   processing, as input,    -   (i) a particular gapped spatial representation of the particular        benign protein sample,        -   wherein the particular gapped spatial representation is            generated            -   by using the particular reference amino acid as a gap                amino acid, and            -   by using remaining amino acids at remaining positions in                the particular benign protein sample as non-gap amino                acids, and    -   (ii) a representation of the particular reference amino acid as        a particular alternate amino acid; and generating, as output, a        pathogenicity score for the particular reference amino acid.-   15. The computer-implemented method of clause 14, wherein each of    the benign protein samples has a ground truth benignness label that    indicates absolute benignness of the benign protein samples.-   16. The computer-implemented method of clause 15, wherein the ground    truth benignness label is zero.-   17. The computer-implemented method of clause 16, wherein the    pathogenicity score for the particular reference amino acid is    compared against the ground truth benignness label to determine an    error, and to improve coefficients of the pathogenicity predictor    based on the error using a training technique.-   18. The computer-implemented method of clause 1, wherein the    pathogenicity predictor is trained on a pathogenic training set.-   19. The computer-implemented method of clause 18, wherein the    pathogenic training set has respective pathogenic protein samples    for respective combinatorically generated amino acid substitutions    for each of the reference amino acids at each of the respective    positions in the proteome.-   20. The computer-implemented method of clause 19, wherein the    combinatorically generated amino acid substitutions for a particular    reference amino acid of a particular amino acid class at a    particular position in the proteome include respective alternate    amino acids of respective amino acid classes that are different from    the particular amino acid class.-   21. The computer-implemented method of clause 20, wherein the    proteome has the ten million positions, wherein there are nineteen    combinatorically generated amino acid substitutions for each of the    ten million positions, and therefore the pathogenic training set has    one hundred and ninety million pathogenic protein samples.-   22. The computer-implemented method of clause 21, wherein the    respective pathogenic protein samples have respective gapped spatial    representations generated by using the respective reference amino    acids as respective gap amino acids.-   23. The computer-implemented method of clause 22, wherein the    respective pathogenic protein samples have respective    representations of the respective combinatorically generated amino    acid substitutions as respective alternate amino acids created by    respective combinatorically generated nucleotide variants at the    respective positions in the proteome.-   24. The computer-implemented method of clause 23, wherein the    pathogenicity predictor trains on a particular pathogenic protein    sample and estimates a pathogenicity of a particular    combinatorically generated amino acid substitution for a particular    reference amino acid at a particular position in the particular    pathogenic protein sample by-   processing, as input,    -   (i) a particular gapped spatial representation of the particular        pathogenic protein sample,        -   wherein the particular gapped spatial representation is            generated            -   by using the particular reference amino acid as a gap                amino acid, and            -   by using remaining amino acids at remaining positions in                the particular pathogenic protein sample as non-gap                amino acids, and    -   (ii) a representation of the particular combinatorically        generated amino acid substitution as a particular alternate        amino acid; and-   generating, as output, a pathogenicity score for the particular    combinatorically generated amino acid substitution.-   25. The computer-implemented method of clause 24, wherein each of    the pathogenic protein samples has a ground truth pathogenicity    label that indicates absolute pathogenicity of the pathogenic    protein samples.-   26. The computer-implemented method of clause 25, wherein the ground    truth pathogenicity label is one.-   27. The computer-implemented method of clause 26, wherein the    pathogenicity score for the particular combinatorically generated    amino acid substitution is compared against the ground truth    pathogenicity label to determine an error, and to improve the    coefficients of the pathogenicity predictor based on the error using    the training technique.-   28. The computer-implemented method of clause 27, wherein the    pathogenicity predictor is trained on two hundred million training    iterations,-   wherein the two hundred million training iterations include    -   ten million training iterations with the ten million benign        protein samples, and    -   one hundred and ninety million iterations with the one hundred        and ninety million pathogenic protein samples.-   29. The computer-implemented method of clause 10, wherein the    proteome has one million to ten million positions, and therefore the    benign training set has one million to ten million benign protein    samples,-   wherein there are nineteen combinatorically generated amino acid    substitutions for each of the one million to ten million positions,    and therefore the pathogenic training set has nineteen million to    one hundred and ninety million pathogenic protein samples.-   30. The computer-implemented method of clause 29, wherein the    pathogenicity predictor is trained on twenty million to two hundred    million training iterations,-   wherein the twenty million to two hundred million training    iterations include    -   one million to ten million training iterations with the one        million to ten million benign protein samples, and    -   nineteen million to one hundred and ninety million iterations        with the nineteen million to one hundred and ninety million        pathogenic protein samples.-   31. The computer-implemented method of clause 6, wherein the    alternate amino acid is an amino acid that is same as the reference    amino acid.-   32. The computer-implemented method of clause 31, wherein the    alternate amino acid is an amino acid that is different from the    reference amino acid.-   33. The computer-implemented method of clause 32, wherein the    pathogenicity predictor generates a first pathogenicity score for a    first alternate amino acid that is same as a first reference amino    acid,-   wherein the pathogenicity predictor generates a second pathogenicity    score for a second alternate amino acid that is different from the    first reference amino acid.-   34. The computer-implemented method of clause 33, wherein a final    pathogenicity score for the second alternate amino acid is the    second pathogenicity score.-   35. The computer-implemented method of clause 34, wherein the final    pathogenicity score for the second alternate amino acid is based on    a combination of the first pathogenicity score and the second    pathogenicity score.-   36. The computer-implemented method of clause 35, wherein the final    pathogenicity score for the second alternate amino acid is a ratio    of the second pathogenicity score over a sum of the first    pathogenicity score and the second pathogenicity score.-   37. The computer-implemented method of clause 36, wherein the final    pathogenicity score for the second alternate amino acid is    determined by subtracting the first pathogenicity score from the    second pathogenicity score.-   38. The computer-implemented method of clause 1, wherein the spatial    configurations of the non-gap amino acids are encoded as    evolutionary profile channels based on pan-amino acid conservation    frequencies of amino acids with nearest atoms to the voxels.-   39. The computer-implemented method of clause 38, wherein the    spatial configuration of the gap amino acid is excluded from the    gapped spatial representation by disregarding nearest atoms of the    gap amino acid when determining the pan-amino acid conservation    frequencies.-   40. The computer-implemented method of clause 1, wherein the spatial    configurations of the non-gap amino acids are encoded as    evolutionary profile channels based on per-amino acid conservation    frequencies of respective amino acids with respective nearest atoms    to the voxels.-   41. The computer-implemented method of clause 40, wherein the    spatial configuration of the gap amino acid is excluded from the    gapped spatial representation by disregarding respective nearest    atoms of the gap amino acid when determining the per-amino acid    conservation frequencies.-   42. The computer-implemented method of clause 1, wherein the spatial    configurations of the non-gap amino acids are encoded as annotation    channels.-   43. The computer-implemented method of clause 42, wherein the    spatial configuration of the gap amino acid is excluded from the    gapped spatial representation by disregarding atoms of the gap amino    acid when determining the annotation channels.-   44. The computer-implemented method of clause 1, wherein the spatial    configurations of the non-gap amino acids are encoded as structural    confidence channels.-   45. The computer-implemented method of clause 44, wherein the    spatial configuration of the gap amino acid is excluded from the    gapped spatial representation by disregarding atoms of the gap amino    acid when determining the structural confidence channels.-   46. The computer-implemented method of clause 1, wherein the spatial    configurations of the non-gap amino acids are encoded as additional    input channels.-   47. The computer-implemented method of clause 46, wherein the    spatial configuration of the gap amino acid is excluded from the    gapped spatial representation by disregarding atoms of the gap amino    acid when determining the additional input channels.-   48. The computer-implemented method of clause 9, wherein the    proteome includes human proteome and non-human proteome, including    non-human primate proteome.-   49. The computer-implemented method of clause 7, wherein those    unreachable alternate amino acid classes that are confined by    reachability of single nucleotide polymorphisms (SNPs) to transform    a reference codon of a reference amino acid into alternate amino    acids of the unreachable alternate amino acid classes are masked in    ground truth labels.-   50. The computer-implemented method of clause 1, wherein masked    amino acid classes result in zero loss and do not contribute to    gradient updates.-   51. The computer-implemented method of clause 50, wherein the masked    amino acid classes are identified in a lookup table.-   52. The computer-implemented method of clause 51, wherein the lookup    table identifies a set of masked amino acids classes for each    reference amino acid position.

Clauses Set 2

-   1. A computer-implemented method of determining pathogenicity of    nucleotide variants, including:-   accessing a protein that has respective amino acids at respective    positions;-   specifying a particular amino acid of a particular amino acid class    at a particular position in the protein as a gap amino acid, and    specifying remaining amino acids at remaining positions in the    protein as non-gap amino acids;-   generating a gapped spatial representation of the protein that    -   includes spatial configurations of the non-gap amino acids, and    -   excludes a spatial configuration of the gap amino acid; and-   based at least in part on the gapped spatial representation,    determining a pathogenicity of respective alternate amino acids at    the particular position.-   2. The computer-implemented method of clause 1, wherein the spatial    configurations of the non-gap amino acids are encoded as amino acid    class-wise distance channels,-   wherein each of the amino acid class-wise distance channels has    voxel-wise distance values for voxels in a plurality of voxels, and-   wherein the voxel-wise distance values specify distances from    corresponding voxels in the plurality of voxels to atoms of the    non-gap amino acids.-   3. The computer-implemented method of clause 2, wherein the spatial    configurations of the non-gap amino acids are determined based on    spatial proximity between the corresponding voxels and the atoms of    the non-gap amino acids.-   4. The computer-implemented method of clause 2, wherein the spatial    configuration of the gap amino acid is excluded from the gapped    spatial representation by disregarding distances from the    corresponding voxels to atoms of the gap amino acid when determining    the voxel-wise distance values.-   5. The computer-implemented method of clause 4, wherein the spatial    configuration of the gap amino acid is excluded from the gapped    spatial representation by disregarding spatial proximity between the    corresponding voxels and the atoms of the gap amino acid.-   6. The computer-implemented method of clause 1, wherein the    particular amino acid is a reference amino acid that is a major    allele of the protein.-   7. The computer-implemented method of clause 1, wherein the    respective alternate amino acids are respective combinatorically    generated alternate amino acids created by respective    combinatorically generated nucleotide variants at the particular    position.-   8. The computer-implemented method of clause 1, wherein a    pathogenicity predictor determines the pathogenicity of the    respective alternate amino acids by-   processing, as input, the gapped spatial representation; and-   generating, as output, respective pathogenicity scores for    respective amino acid classes.-   9. The computer-implemented method of clause 8, wherein the    pathogenicity predictor is trained on a training set.-   10. The computer-implemented method of clause 9, wherein the    training set has respective protein samples for respective positions    in a proteome.-   11. The computer-implemented method of clause 10, wherein the    proteome has ten million positions, and therefore the training set    has ten million protein samples.-   12. The computer-implemented method of clause 11, wherein the    respective protein samples have respective gapped spatial    representations generated by using respective reference amino acids    at the respective positions in proteome as respective gap amino    acids.-   13. The computer-implemented method of clause 12, wherein the    reference amino acids are major allele amino acids of the proteome.-   14. The computer-implemented method of clause 13, wherein the    pathogenicity predictor trains on a particular protein sample and    estimates a pathogenicity of respective alternate amino acids for a    particular reference amino acid at a particular position in the    particular protein sample by-   processing, as input,    -   a particular gapped spatial representation of the particular        protein sample,        -   wherein the particular gapped spatial representation is            generated            -   by using the particular reference amino acid as a gap                amino acid, and            -   by using remaining amino acids at remaining positions in                the particular protein sample as non-gap amino acids;                and-   generating, as output, respective pathogenicity scores for the    respective amino acid classes.-   15. The computer-implemented method of clause 14, wherein each of    the protein samples has respective ground truth labels for the    respective amino acid classes.-   16. The computer-implemented method of clause 15, wherein the    respective ground truth labels include an absolute benignness label    for a reference amino acid class in the respective amino acid    classes, and include respective absolute pathogenicity labels for    respective alternate amino acid classes in the respective amino acid    classes.-   17. The computer-implemented method of clause 16, wherein the    absolute benignness label is zero.-   18. The computer-implemented method of clause 17, wherein the    absolute pathogenicity labels are same across the respective    alternate amino acid classes.-   19. The computer-implemented method of clause 18, wherein the    absolute pathogenicity labels are one.-   20. The computer-implemented method of clause 1, wherein an error is    determined based on-   a comparison of a pathogenicity score for the reference amino acid    class against the absolute benignness label, and-   respective comparisons of respective pathogenicity scores for the    respective alternate amino acid classes against the respective    absolute pathogenicity labels.-   21. The computer-implemented method of clause 20, wherein    coefficients of the pathogenicity predictor are improved based on    the error using a training technique.-   22. The computer-implemented method of clause 21, wherein the    pathogenicity predictor is trained on ten million training    iterations with the ten million protein samples.-   23. The computer-implemented method of clause 8, wherein the    respective amino acid classes correspond to respective twenty    naturally-occurring amino acids.-   24. The computer-implemented method of clause 23, wherein the    respective amino acid classes correspond to respective    naturally-occurring amino acids from a subset of the twenty    naturally-occurring amino acids.-   25. The computer-implemented method of clause 11, wherein the    proteome has one million to ten million positions, and therefore the    training set has one million to ten million protein samples,-   wherein the pathogenicity predictor is trained on one million to ten    million training iterations with the one million to ten million    protein samples.-   26. The computer-implemented method of clause 8, wherein the    pathogenicity predictor generates a reference pathogenicity score    for a first alternate amino acid of the reference amino acid class,-   wherein the pathogenicity predictor generates respective alternate    pathogenicity scores for respective alternate amino acids of the    respective alternate amino acid classes.-   27. The computer-implemented method of clause 26, wherein respective    final alternate pathogenicity scores for the respective alternate    amino acids are the respective alternate pathogenicity scores.-   28. The computer-implemented method of clause 27, wherein the    respective final alternate pathogenicity scores for the respective    alternate amino acids are based on respective combinations of the    reference pathogenicity score and the respective alternate    pathogenicity scores.-   29. The computer-implemented method of clause 28, wherein the    respective final alternate pathogenicity scores for the respective    alternate amino acids are respective ratios of the respective    alternate pathogenicity scores over a sum of the reference    pathogenicity score and the respective alternate pathogenicity    scores.-   30. The computer-implemented method of clause 29, wherein the    respective final alternate pathogenicity scores for the respective    alternate amino acids are determined by respectively subtracting the    reference pathogenicity score from the respective alternate    pathogenicity scores.-   31. The computer-implemented method of clause 8, wherein the    pathogenicity predictor has an output layer that generates the    respective pathogenicity scores.-   32. The computer-implemented method of clause 31, wherein the output    layer is a normalization layer.-   33. The computer-implemented method of clause 32, wherein the    respective pathogenicity scores are normalized.-   34. The computer-implemented method of clause 31, wherein the output    layer is a softmax layer.-   35. The computer-implemented method of clause 34, wherein the    respective pathogenicity scores are exponentially normalized.-   36. The computer-implemented method of clause 31, wherein the output    layer has respective sigmoid units that respectively generate the    respective pathogenicity scores.-   37. The computer-implemented method of clause 31, wherein the    respective pathogenicity scores are unnormalized.-   38. The computer-implemented method of clause 1, wherein the spatial    configurations of the non-gap amino acids are encoded as    evolutionary profile channels based on pan-amino acid conservation    frequencies of amino acids with nearest atoms to the voxels.-   39. The computer-implemented method of clause 38, wherein the    spatial configuration of the gap amino acid is excluded from the    gapped spatial representation by disregarding nearest atoms of the    gap amino acid when determining the pan-amino acid conservation    frequencies.-   40. The computer-implemented method of clause 1, wherein the spatial    configurations of the non-gap amino acids are encoded as    evolutionary profile channels based on per-amino acid conservation    frequencies of respective amino acids with respective nearest atoms    to the voxels.-   41. The computer-implemented method of clause 40, wherein the    spatial configuration of the gap amino acid is excluded from the    gapped spatial representation by disregarding respective nearest    atoms of the gap amino acid when determining the per-amino acid    conservation frequencies.-   42. The computer-implemented method of clause 1, wherein the spatial    configurations of the non-gap amino acids are encoded as annotation    channels.-   43. The computer-implemented method of clause 42, wherein the    spatial configuration of the gap amino acid is excluded from the    gapped spatial representation by disregarding atoms of the gap amino    acid when determining the annotation channels.-   44. The computer-implemented method of clause 1, wherein the spatial    configurations of the non-gap amino acids are encoded as structural    confidence channels.-   45. The computer-implemented method of clause 44, wherein the    spatial configuration of the gap amino acid is excluded from the    gapped spatial representation by disregarding atoms of the gap amino    acid when determining the structural confidence channels.-   46. The computer-implemented method of clause 1, wherein the spatial    configurations of the non-gap amino acids are encoded as additional    input channels.-   47. The computer-implemented method of clause 46, wherein the    spatial configuration of the gap amino acid is excluded from the    gapped spatial representation by disregarding atoms of the gap amino    acid when determining the additional input channels.-   48. The computer-implemented method of clause 10, wherein the    proteome includes human proteome and non-human proteome, including    non-human primate proteome.-   49. The computer-implemented method of clause 8, wherein those    unreachable alternate amino acid classes that are confined by    reachability of single nucleotide polymorphisms (SNPs) to transform    a reference codon of a reference amino acid into alternate amino    acids of the unreachable alternate amino acid classes are masked in    ground truth labels.-   50. The computer-implemented method of clause 1, wherein masked    amino acid classes result in zero loss and do not contribute to    gradient updates.-   51. The computer-implemented method of clause 50, wherein the masked    amino acid classes are identified in a lookup table.-   52. The computer-implemented method of clause 51, wherein the lookup    table identifies a set of masked amino acids classes for each    reference amino acid position.

Clauses Set 3

-   1. A computer-implemented method of generating training data for    training a variant pathogenicity classifier, including:-   accessing multitude of amino acid positions in a proteome with a    plurality of proteins;-   specifying major allele amino acids at the multitude of amino acid    positions as reference amino acids of the plurality of proteins;-   for each amino acid position in the multitude of amino acids    positions,    -   classifying those nucleotide substitutions as benign variants        that substitute a particular reference amino acid with the        particular reference amino acid at a particular amino acid        position in a particular alternate representation of a        particular protein, and    -   classifying those nucleotide substitutions as pathogenic        variants that substitute the particular reference amino acid        with alternate amino acids at the particular amino acid        position, wherein the alternate amino acids are different from        the particular reference amino acid; and-   training a variant pathogenicity classifier using the benign    variants and the pathogenic variants as training data.-   2. The computer-implemented method of clause 1, wherein the variant    pathogenicity classifier is trained to determine whether a    substitution of a first amino acid with a second amino acid at a    given amino acid position in a protein is pathogenic or benign.-   3. The computer-implemented method of clause 2, wherein the variant    pathogenicity classifier is trained to generate a pathogenicity    score for the substitution.-   4. The computer-implemented method of clause 1, wherein the variant    pathogenicity classifier is trained to determine whether respective    substitutions of a first amino acid with respective amino acids at a    given amino acid position in a protein are pathogenic or benign.-   5. The computer-implemented method of clause 4, wherein the variant    pathogenicity classifier is trained to generate respective    pathogenicity scores for the respective substitutions.-   6. The computer-implemented method of clause 5, wherein the    respective amino acids correspond to respective twenty    naturally-occurring amino acids.-   7. The computer-implemented method of clause 6, wherein the    respective amino acids correspond to respective naturally-occurring    amino acids from a subset of the twenty naturally-occurring amino    acids.-   8. The computer-implemented method of clause 1, wherein the variant    pathogenicity classifier is trained to determine whether an    insertion of an amino acid at a given vacant amino acid position in    a protein is pathogenic or benign.-   9. The computer-implemented method of clause 8, wherein the variant    pathogenicity classifier is trained to generate a pathogenicity    score for the insertion.-   10. The computer-implemented method of clause 1, wherein the variant    pathogenicity classifier is trained to determine whether respective    insertions of respective amino acids at a given vacant amino acid    position in a protein are pathogenic or benign.-   11. The computer-implemented method of clause 10, wherein the    variant pathogenicity classifier is trained to generate respective    pathogenicity scores for the respective insertions.-   12. The computer-implemented method of clause 11, wherein the    respective amino acids correspond to respective twenty    naturally-occurring amino acids.-   13. The computer-implemented method of clause 12, wherein the    respective amino acids correspond to respective naturally-occurring    amino acids from a subset of the twenty naturally-occurring amino    acids.-   14. The computer-implemented method of clause 1, wherein the variant    pathogenicity classifier is trained to determine whether a    substitution of a first amino acid with a second amino acid at a    given amino acid position in a protein is spatially tolerated by    other amino acids of the protein or not.-   15. The computer-implemented method of clause 14, wherein the    variant pathogenicity classifier is trained to generate a spatial    tolerance score for the substitution.-   16. The computer-implemented method of clause 1, wherein the variant    pathogenicity classifier is trained to determine whether respective    substitutions of a first amino acid with respective amino acids at a    given amino acid position in a protein are spatially tolerated by    other amino acids of the protein or not.-   17. The computer-implemented method of clause 16, wherein the    variant pathogenicity classifier is trained to generate respective    spatial tolerance scores for the respective substitutions.-   18. The computer-implemented method of clause 17, wherein the    respective amino acids correspond to respective twenty    naturally-occurring amino acids.-   19. The computer-implemented method of clause 18, wherein the    respective amino acids correspond to respective naturally-occurring    amino acids from a subset of the twenty naturally-occurring amino    acids.-   20. The computer-implemented method of clause 1, wherein the variant    pathogenicity classifier is trained to determine whether an    insertion of an amino acid at a given vacant amino acid position in    a protein is spatially tolerated by other amino acids of the protein    or not.-   21. The computer-implemented method of clause 20, wherein the    variant pathogenicity classifier is trained to generate a spatial    tolerance score for the insertion.-   22. The computer-implemented method of clause 1, wherein the variant    pathogenicity classifier is trained to determine whether respective    insertions of respective amino acids at a given vacant amino acid    position in a protein are spatially tolerated by other amino acids    of the protein or not.-   23. The computer-implemented method of clause 22, wherein the    variant pathogenicity classifier is trained to generate respective    spatial tolerance scores for the respective insertions.-   24. The computer-implemented method of clause 23, wherein the    respective amino acids correspond to respective twenty    naturally-occurring amino acids.-   25. The computer-implemented method of clause 24, wherein the    respective amino acids correspond to respective naturally-occurring    amino acids from a subset of the twenty naturally-occurring amino    acids.-   26. The computer-implemented method of clause 1, wherein the variant    pathogenicity classifier is trained to determine whether a    substitution of a first amino acid with a second amino acid at a    given amino acid position in a protein is evolutionary conserved or    non-conserved.-   27. The computer-implemented method of clause 26, wherein the    variant pathogenicity classifier is trained to generate an    evolutionary conservation score for the substitution.-   28. The computer-implemented method of clause 1, wherein the variant    pathogenicity classifier is trained to determine whether respective    substitutions of a first amino acid with respective amino acids at a    given amino acid position in a protein are evolutionary conserved or    non-conserved.-   29. The computer-implemented method of clause 28, wherein the    variant pathogenicity classifier is trained to generate respective    evolutionary conservation scores for the respective substitutions.-   30. The computer-implemented method of clause 29, wherein the    respective amino acids correspond to respective twenty    naturally-occurring amino acids.-   31. The computer-implemented method of clause 30, wherein the    respective amino acids correspond to respective naturally-occurring    amino acids from a subset of the twenty naturally-occurring amino    acids.-   32. The computer-implemented method of clause 1, wherein the variant    pathogenicity classifier is trained to determine whether an    insertion of an amino acid at a given vacant amino acid position in    a protein is evolutionary conserved or non-conserved.-   33. The computer-implemented method of clause 32, wherein the    variant pathogenicity classifier is trained to generate an    evolutionary conservation score for the insertion.-   34. The computer-implemented method of clause 1, wherein the variant    pathogenicity classifier is trained to determine whether respective    insertions of respective amino acids at a given vacant amino acid    position in a protein are evolutionary conserved or non-conserved.-   35. The computer-implemented method of clause 34, wherein the    variant pathogenicity classifier is trained to generate respective    evolutionary conservation scores for the respective insertions.-   36. The computer-implemented method of clause 35, wherein the    respective amino acids correspond to respective twenty    naturally-occurring amino acids.-   37. The computer-implemented method of clause 36, wherein the    respective amino acids correspond to respective naturally-occurring    amino acids from a subset of the twenty naturally-occurring amino    acids.-   38. The computer-implemented method of clause 14, wherein spatial    tolerance corresponds to structural tolerance, and spatial    intolerance corresponds to structural intolerance.-   39. The computer-implemented method of clause 1, wherein the    multitude of amino acids positions range from one million to ten    million amino acid positions.-   40. The computer-implemented method of clause 1, wherein the    multitude of amino acids positions range from ten million to hundred    million amino acid positions.-   41. The computer-implemented method of clause 1, wherein the    multitude of amino acids positions range from hundred million to a    billion amino acid positions.-   42. The computer-implemented method of clause 1, wherein the    multitude of amino acids positions range from one to a million amino    acid positions.-   43. The computer-implemented method of clause 1, wherein those    unreachable alternate amino acid classes that are confined by    reachability of single nucleotide polymorphisms (SNPs) to transform    a reference codon of a reference amino acid into alternate amino    acids of the unreachable alternate amino acid classes are masked in    ground truth labels.-   44. The computer-implemented method of clause 1, wherein masked    amino acid classes result in zero loss and do not contribute to    gradient updates.-   45. The computer-implemented method of clause 44, wherein the masked    amino acid classes are identified in a lookup table.-   46. The computer-implemented method of clause 45, wherein the lookup    table identifies a set of masked amino acids classes for each    reference amino acid position.

Clauses Set 4

-   1. A computer-implemented method of determining pathogenicity of    nucleotide variants, including:-   specifying a particular amino acid at a particular position in a    protein as a gap amino acid, and specifying remaining amino acids at    remaining positions in the protein as non-gap amino acids;-   generating a gapped spatial representation of the protein that    -   includes spatial configurations of the non-gap amino acids, and    -   excludes a spatial configuration of the gap amino acid;-   determining an evolutionary conservation at the particular position    of respective amino acids of respective amino acid classes based at    least in part on the gapped spatial representation; and-   based at least in part on the evolutionary conservation of the    respective amino acids, determining a pathogenicity of respective    nucleotide variants that respectively substitute the particular    amino acid with the respective amino acids in alternate    representations of the protein.-   2. The computer-implemented method of clause 1, wherein the spatial    configurations of the non-gap amino acids are encoded as amino acid    class-wise distance channels,-   wherein each of the amino acid class-wise distance channels has    voxel-wise distance values for voxels in a plurality of voxels, and-   wherein the voxel-wise distance values specify distances from    corresponding voxels in the plurality of voxels to atoms of the    non-gap amino acids.-   3. The computer-implemented method of clause 2, wherein the spatial    configurations of the non-gap amino acids are determined based on    spatial proximity between the corresponding voxels and the atoms of    the non-gap amino acids.-   4. The computer-implemented method of clause 2, wherein the spatial    configuration of the gap amino acid is excluded from the gapped    spatial representation by disregarding distances from the    corresponding voxels to atoms of the gap amino acid when determining    the voxel-wise distance values.-   5. The computer-implemented method of clause 4, wherein the spatial    configuration of the gap amino acid is excluded from the gapped    spatial representation by disregarding spatial proximity between the    corresponding voxels and the atoms of the gap amino acid.-   6. The computer-implemented method of clause 1, wherein the    particular amino acid is a reference amino acid that is a major    allele of the protein.-   7. The computer-implemented method of clause 1, wherein an    evolutionary conservation predictor determines the evolutionary    conservation by-   processing, as input, the gapped spatial representation; and-   generating, as output, respective evolutionary conservation scores    for the respective amino acids.-   8. The computer-implemented method of clause 7, wherein the    respective evolutionary conservation scores are rankable by    magnitude.-   9. The computer-implemented method of clause 7, further including    classifying a nucleotide variant as pathogenic when an evolutionary    conservation score generated by the evolutionary conservation    predictor for a corresponding amino acid substitution is below a    threshold.-   10. The computer-implemented method of clause 7, further including    classifying a nucleotide variant as pathogenic when an evolutionary    conservation score generated by the evolutionary conservation    predictor for a corresponding amino acid substitution is zero.-   11. The computer-implemented method of clause 7, further including    classifying a nucleotide variant as benign when an evolutionary    conservation score generated by the evolutionary conservation    predictor for a corresponding amino acid substitution is above a    threshold.-   12. The computer-implemented method of clause 7, further including    classifying a nucleotide variant as benign when an evolutionary    conservation score generated by the evolutionary conservation    predictor for a corresponding amino acid substitution is non-zero.-   13. The computer-implemented method of clause 7, wherein the    evolutionary conservation predictor is trained on a conserved    training set and a non-conserved training set.-   14. The computer-implemented method of clause 13, wherein the    conserved training set has respective conserved protein samples for    respective conserved amino acids at respective positions in a    proteome,-   wherein the non-conserved training set has respective non-conserved    protein samples for respective non-conserved amino acids at the    respective positions.-   15. The computer-implemented method of clause 14, wherein each of    the respective positions has a set of conserved amino acids and a    set of non-conserved amino acids.-   16. The computer-implemented method of clause 15, wherein a    particular set of conserved amino acids for a particular position in    a particular protein in the proteome includes at least one major    allele amino acid observed at the particular position across a    plurality of species.-   17. The computer-implemented method of clause 16, wherein the    particular set of conserved amino acids includes one or more minor    allele amino acids observed at the particular position across the    plurality of species.-   18. The computer-implemented method of clause 17, wherein a    particular set of non-conserved amino acids for the particular    position includes amino acids not in the particular set of conserved    amino acids.-   19. The computer-implemented method of clause 18, wherein the    particular set of conserved amino acids and the particular set of    non-conserved amino acids are identified based on evolutionary    conservation profiles of homologous proteins of the plurality of    species.-   20. The computer-implemented method of clause 18, wherein the    evolutionary conservation profiles of the homologous proteins are    determined using a position-specific frequency matrix (PSFM).-   21. The computer-implemented method of clause 18, wherein the    evolutionary conservation profiles of the homologous proteins are    determined using a position-specific scoring matrix (PSSM).-   22. The computer-implemented method of clause 16, wherein the major    allele amino acid is a reference amino acid.-   23. The computer-implemented method of clause 14, wherein each of    the respective positions has C conserved amino acids in the set of    conserved amino acids,-   wherein each of the respective positions has NC non-conserved amino    acids in the set of non-conserved amino acids, where NC=20−C,-   wherein the conserved training set has CP conserved protein samples,    where CP=a number of the respective positions*C, and-   wherein the non-conserved training set has NCP non-conserved protein    samples, where NCP=the number of the respective positions*(20−C).-   24. The computer-implemented method of clause 23, wherein the C    ranges from one to ten.-   25. The computer-implemented method of clause 24, wherein the C    varies across the respective positions.-   26. The computer-implemented method of clause 25, wherein the C is    same for some of the respective positions.-   27. The computer-implemented method of clause 14, wherein the    respective conserved and non-conserved protein samples have    respective gapped spatial representations generated by using    respective reference amino acids at the respective positions as    respective gap amino acids.-   28. The computer-implemented method of clause 27, wherein the    evolutionary conservation predictor trains on a particular conserved    protein sample and estimates an evolutionary conservation of a    particular conserved amino acid at a particular position in the    particular conserved protein sample by-   processing, as input,    -   a particular gapped spatial representation of the particular        conserved protein sample,        -   wherein the particular gapped spatial representation is            generated            -   by using a particular reference amino acid at the                particular position as a gap amino acid, and            -   by using remaining amino acids at remaining positions in                the particular conserved protein sample as non-gap amino                acids; and-   generating, as output, an evolutionary conservation score for the    particular conserved amino acid.-   29. The computer-implemented method of clause 28, wherein each of    the conserved protein samples has a ground truth conserved label.-   30. The computer-implemented method of clause 29, wherein the ground    truth conserved label is an evolutionary conservation frequency.-   31. The computer-implemented method of clause 29, wherein the ground    truth conserved label is one.-   32. The computer-implemented method of clause 29, wherein the    evolutionary conservation for the particular conserved amino acid is    compared against the ground truth conserved label to determine an    error, and to improve coefficients of the evolutionary conservation    predictor based on the error using a training technique.-   33. The computer-implemented method of clause 32, wherein the ground    truth conserved label is masked and not used to determine the error    when the particular conserved amino acid is the particular reference    amino acid,-   wherein the masking causes the evolutionary conservation predictor    to not overfit on the particular reference amino acid.-   34. The computer-implemented method of clause 32, wherein the    training technique is a loss function-based gradient update    technique.-   35. The computer-implemented method of clause 27, wherein the    evolutionary conservation predictor trains on a particular    non-conserved protein sample and estimates an evolutionary    conservation of a particular non-conserved amino acid at a    particular position in the particular non-conserved protein sample    by-   processing, as input,    -   a particular gapped spatial representation of the particular        non-conserved protein sample,        -   wherein the particular gapped spatial representation is            generated            -   by using a particular reference amino acid at the                particular position as a gap amino acid, and            -   by using remaining amino acids at remaining positions in                the particular non-conserved protein sample as non-gap                amino acids; and-   generating, as output, an evolutionary conservation score for the    particular non-conserved amino acid.-   36. The computer-implemented method of clause 35, wherein each of    the non-conserved protein samples has a ground truth non-conserved    label.-   37. The computer-implemented method of clause 35, wherein the ground    truth non-conserved label is an evolutionary conservation frequency.-   38. The computer-implemented method of clause 35, wherein the ground    truth non-conserved label is zero.-   39. The computer-implemented method of clause 35, wherein the    evolutionary conservation score for the particular non-conserved    amino acid is compared against the ground truth non-conserved label    to determine an error, and to improve the coefficients of the    evolutionary conservation predictor based on the error using the    training technique.-   40. The computer-implemented method of clause 7, wherein the    evolutionary conservation predictor is trained on a training set.-   41. The computer-implemented method of clause 40, wherein the    training set has respective protein samples for the respective    positions in the proteome.-   42. The computer-implemented method of clause 41, wherein the    respective protein samples have respective gapped spatial    representations generated by using the respective reference amino    acids at the respective positions as the respective gap amino acids.-   43. The computer-implemented method of clause 42, wherein the    evolutionary conservation predictor trains on a particular protein    sample and estimates an evolutionary conservation of respective    amino acids of respective amino acid classes at a particular    position in the particular protein sample by-   processing, as input,    -   a particular gapped spatial representation of the particular        protein sample,        -   wherein the particular gapped spatial representation is            generated            -   by using a particular reference amino acid at the                particular position as a gap amino acid, and            -   by using remaining amino acids at remaining positions in                the particular protein sample as non-gap amino acids;                and-   generating, as output, respective evolutionary conservation scores    for the respective amino acids.-   44. The computer-implemented method of clause 43, wherein each of    the protein samples has respective ground truth labels for the    respective amino acids.-   45. The computer-implemented method of clause 44, wherein the    respective ground truth labels include one or more conserved labels    for one or more conserved amino acids in the respective amino acids,    and include one or more non-conserved labels for one or more    non-conserved amino acids in the respective amino acids.-   46. The computer-implemented method of clause 45, wherein the    conserved labels and the non-conserved labels have respective    evolutionary conservation frequencies.-   47. The computer-implemented method of clause 46, wherein the    respective evolutionary conservation frequencies are rankable    according to magnitude.-   48. The computer-implemented method of clause 46, wherein the    conserved labels are ones, and the non-conserved labels are zeros.-   49. The computer-implemented method of clause 46, wherein an error    is determined based on-   respective comparisons of respective evolutionary conservation    scores for the respective conserved amino acids against the    respective conserved amino acids, and-   respective comparisons of respective evolutionary conservation    scores for the respective non-conserved amino acids against the    respective non-conserved amino acids.-   50. The computer-implemented method of clause 49, wherein    coefficients of the evolutionary conservation predictor are improved    based on the error using the training technique.-   51. The computer-implemented method of clause 50, wherein the    conserved amino acids include the particular reference amino acid,    and a conserved label for the particular reference amino acid is    masked and not used to determine the error,-   wherein the masking causes the evolutionary conservation predictor    to not overfit on the particular reference amino acid.-   52. The computer-implemented method of clause 14, wherein the    proteome has one to ten million positions,-   wherein each of the one to ten million positions has the C conserved    amino acids in the set of conserved amino acids,-   wherein each of the one to ten million positions has the NC    non-conserved amino acids in the set of non-conserved amino acids,    where NC=20−C,-   wherein the conserved training set has the CP conserved protein    samples, where CP=one to ten million*C, and-   wherein the non-conserved training set has the NCP non-conserved    protein samples, where NCP=one to ten million *(20−C).-   53. The computer-implemented method of clause 14, wherein the    evolutionary conservation predictor is trained on twenty million to    two hundred million training iterations,-   wherein the twenty million to two hundred million training    iterations include    -   one million to ten million training iterations with the one        million to ten million conserved protein samples, and    -   nineteen million to one hundred and ninety million iterations        with the nineteen million to one hundred and ninety million        non-conserved protein samples.    -   54. The computer-implemented method of clause 14, wherein the        proteome has one million to ten million positions, and therefore        the training set has one million to ten million protein samples,-   wherein the evolutionary conservation predictor is trained on one    million to ten million training iterations with the one million to    ten million protein samples.-   55. The computer-implemented method of clause 1, wherein the spatial    configurations of the non-gap amino acids are encoded as    evolutionary profile channels based on pan-amino acid conservation    frequencies of amino acids with nearest atoms to the voxels.-   56. The computer-implemented method of clause 55, wherein the    spatial configuration of the gap amino acid is excluded from the    gapped spatial representation by disregarding nearest atoms of the    gap amino acid when determining the pan-amino acid conservation    frequencies.-   57. The computer-implemented method of clause 1, wherein the spatial    configurations of the non-gap amino acids are encoded as    evolutionary profile channels based on per-amino acid conservation    frequencies of respective amino acids with respective nearest atoms    to the voxels.-   58. The computer-implemented method of clause 57, wherein the    spatial configuration of the gap amino acid is excluded from the    gapped spatial representation by disregarding respective nearest    atoms of the gap amino acid when determining the per-amino acid    conservation frequencies.-   59. The computer-implemented method of clause 1, wherein the spatial    configurations of the non-gap amino acids are encoded as annotation    channels.-   60. The computer-implemented method of clause 59, wherein the    spatial configuration of the gap amino acid is excluded from the    gapped spatial representation by disregarding atoms of the gap amino    acid when determining the annotation channels.-   61. The computer-implemented method of clause 1, wherein the spatial    configurations of the non-gap amino acids are encoded as structural    confidence channels.-   62. The computer-implemented method of clause 61, wherein the    spatial configuration of the gap amino acid is excluded from the    gapped spatial representation by disregarding atoms of the gap amino    acid when determining the structural confidence channels.-   63. The computer-implemented method of clause 1, wherein the spatial    configurations of the non-gap amino acids are encoded as structural    confidence channels.-   64. The computer-implemented method of clause 63, wherein the    spatial configuration of the gap amino acid is excluded from the    gapped spatial representation by disregarding atoms of the gap amino    acid when determining the structural confidence channels.-   65. The computer-implemented method of clause 1, wherein the spatial    configurations of the non-gap amino acids are encoded as additional    input channels.-   66. The computer-implemented method of clause 65, wherein the    spatial configuration of the gap amino acid is excluded from the    gapped spatial representation by disregarding atoms of the gap amino    acid when determining the additional input channels.-   67. The computer-implemented method of clause 14, wherein the    proteome includes human proteome and non-human proteome, including    non-human primate proteome.-   68. The computer-implemented method of clause 1, wherein those    unreachable alternate amino acid classes that are confined by    reachability of single nucleotide polymorphisms (SNPs) to transform    a reference codon of a reference amino acid into alternate amino    acids of the unreachable alternate amino acid classes are masked in    ground truth labels.-   69. The computer-implemented method of clause 1, wherein masked    amino acid classes result in zero loss and do not contribute to    gradient updates.-   70. The computer-implemented method of clause 69, wherein the masked    amino acid classes are identified in a lookup table.-   71. The computer-implemented method of clause 70, wherein the lookup    table identifies a set of masked amino acids classes for each    reference amino acid position.

Clauses Set 5

-   1. A computer-implemented method of training a pathogenicity    predictor, including:-   accessing a gapped training set that includes respective gapped    protein samples for respective positions in a proteome;-   accessing a non-gapped training set that includes non-gapped benign    protein samples and non-gapped pathogenic protein samples;-   generating respective gapped spatial representations for the gapped    protein samples, and generating respective non-gapped spatial    representations for the non-gapped benign protein samples and the    non-gapped pathogenic protein samples;-   training a pathogenicity predictor over one or more training cycles    and generating a trained pathogenicity predictor, wherein each of    the training cycles uses as training examples gapped spatial    representations from the respective gapped spatial representations    and non-gapped spatial representations from the respective    non-gapped spatial representations; and-   using the trained pathogenicity classifier to determine    pathogenicity of variants.-   2. The computer-implemented method of clause 1, wherein the    respective gapped protein samples are labelled with respective    gapped ground truth sequences.-   3. The computer-implemented method of clause 2, wherein a particular    gapped ground truth sequence for a particular gapped protein sample    has a benign label for a particular amino acid class that    corresponds to a reference amino acid at a particular position in    the particular gapped protein.-   4. The computer-implemented method of clause 3, wherein the    particular gapped protein sample has respective pathogenic labels    for respective remaining amino acid classes that correspond to    alternate amino acids at the particular position.-   5. The computer-implemented method of clause 1, wherein a particular    non-gapped benign protein sample includes a benign alternate amino    acid at a particular position substituted by a benign nucleotide    variant.-   6. The computer-implemented method of clause 5, wherein a particular    non-gapped pathogenic protein sample includes a pathogenic alternate    amino acid at a particular position substituted by a pathogenic    nucleotide variant.-   7. The computer-implemented method of clause 6, wherein the    particular non-gapped benign protein sample is labelled with a    benign ground truth sequence that has a benign label for a    particular amino acid class that corresponds to the benign alternate    amino acid.-   8. The computer-implemented method of clause 7, wherein the benign    ground truth sequence respective masked labels for respective    remaining amino acid classes that correspond to amino acids that are    different from the benign alternate amino acid.-   9. The computer-implemented method of clause 8, wherein the    particular non-gapped pathogenic protein sample is labelled with a    pathogenic ground truth sequence that has a pathogenic label for a    particular amino acid class that corresponds to the pathogenic    alternate amino acid.-   10. The computer-implemented method of clause 9, wherein the    pathogenic ground truth sequence has respective masked labels for    respective remaining amino acid classes that correspond to amino    acids that are different from the pathogenic alternate amino acid.-   11. The computer-implemented method of clause 1, further including    using a sample indicator to indicate to the pathogenicity predictor    whether a current training example is a gapped spatial    representation for a gapped protein sample, or a non-gapped spatial    representation for a non-gapped protein sample.-   12. The computer-implemented method of clause 1, further including    masking the benign label for the particular amino acid class that    corresponds to the reference amino acid at the particular position    in the particular gapped protein.-   13. The computer-implemented method of clause 1, wherein the    non-gapped benign protein samples are derived from common human and    non-human primate nucleotide variants.-   14. The computer-implemented method of clause 1, wherein the    non-gapped pathogenic protein samples are derived from    combinatorically simulated nucleotide variants.-   15. The computer-implemented method of clause 1, wherein the    pathogenicity predictor generates an amino acid class-wise output    sequence in response to processing a training example, wherein the    amino acid class-wise output sequence has amino acid class-wise    pathogenicity scores.-   16. The computer-implemented method of clause 1, further including    measuring performance of the trained pathogenicity predictor between    training cycles over a validation set.-   17. The computer-implemented method of clause 16, wherein the    validation set includes a pair of gapped and non-gapped spatial    representations for each held-out protein sample.-   18. The computer-implemented method of clause 1, wherein the trained    pathogenicity predictor generates a first amino acid class-wise    output sequence for the gapped spatial representation in the pair,    and a second amino acid class-wise output sequence for the    non-gapped spatial representation in the pair,-   wherein a final pathogenicity score for a nucleotide variant that    causes an amino acid substitution in a held-out protein sample is    determined based on a combination of first and second pathogenicity    scores for the amino acid substitution in the first and second amino    acid class-wise output sequences.-   19. The computer-implemented method of clause 18, wherein the final    pathogenicity score is based on an average of the first and second    pathogenicity scores.-   20. The computer-implemented method of clause 1, wherein at least    some of the training cycles use a same of number of gapped spatial    representations and non-gapped spatial representations.-   21. The computer-implemented method of clause 1, wherein at least    some of the training cycles use batches of training examples that    have a same of number of gapped spatial representations and    non-gapped spatial representations.-   22. The computer-implemented method of clause 1, wherein a masked    label does not contribute to error determination, and therefore does    not contribute to training of the pathogenicity predictor.-   23. The computer-implemented method of clause 22, wherein the masked    label is zeroed-out.-   24. The computer-implemented method of clause 1, wherein the gapped    spatial representations are weighted differently from the non-gapped    spatial representations, such that a contribution of the gapped    spatial representations to gradient updates applied to parameters of    the pathogenicity predictor in response to the pathogenicity    predictor processing the non-gapped spatial representations varies    from a contribution of the non-gapped spatial representations to    gradient updates applied to the parameters of the pathogenicity    predictor in response to the pathogenicity predictor processing the    non-gapped spatial representations.-   25. The computer-implemented method of clause 24, wherein the    variation is determined by pre-defined weights.-   26. A computer-implemented method of training a pathogenicity    predictor, including:-   starting with training a pathogenicity classifier on a gapped    training set and generating a trained pathogenicity classifier;-   further training the trained pathogenicity classifier on a    non-gapped training set and generating a retrained pathogenicity    classifier; and-   using the retrained pathogenicity classifier to determine    pathogenicity of variants.-   27. The computer-implemented method of clause 26, further including    measuring performance of the trained pathogenicity predictor between    training cycles over a first validation set that includes only    non-gapped spatial representations of held-out protein samples.-   28. The computer-implemented method of clause 27, further including    measuring performance of the retrained pathogenicity predictor    between training cycles over a second validation set that includes    gapped spatial representations and non-gapped spatial    representations of held-out protein samples.-   29. The computer-implemented method of clause 28, wherein the    retrained pathogenicity predictor generates a first amino acid    class-wise output sequence for the pair in response to processing    the pair,-   wherein a final pathogenicity score for a nucleotide variant that    causes an amino acid substitution in a corresponding held-out    protein sample is determined based on the first amino acid    class-wise output sequence.-   30. A computer-implemented method of training a pathogenicity    predictor, including:-   accessing a gapped training set that includes respective gapped    protein samples for respective positions in a proteome, wherein the    respective gapped protein samples are labelled with respective    gapped ground truth sequences, wherein a particular gapped ground    truth sequence for a particular gapped protein sample has a benign    label for a particular amino acid class that corresponds to a    reference amino acid at a particular position in the particular    gapped protein, and has respective pathogenic labels for respective    remaining amino acid classes that correspond to alternate amino    acids at the particular position;-   accessing a non-gapped training set that includes non-gapped benign    protein samples and non-gapped pathogenic protein samples, wherein a    particular non-gapped benign protein sample includes a benign    alternate amino acid at a particular position substituted by a    benign nucleotide variant, wherein a particular non-gapped    pathogenic protein sample includes a pathogenic alternate amino acid    at a particular position substituted by a pathogenic nucleotide    variant, wherein the particular non-gapped benign protein sample is    labelled with a benign ground truth sequence that has a benign label    for a particular amino acid class that corresponds to the benign    alternate amino acid, and respective masked labels for respective    remaining amino acid classes that correspond to amino acids that are    different from the benign alternate amino acid, and wherein the    particular non-gapped pathogenic protein sample is labelled with a    pathogenic ground truth sequence that has a pathogenic label for a    particular amino acid class that corresponds to the pathogenic    alternate amino acid, and respective masked labels for respective    remaining amino acid classes that correspond to amino acids that are    different from the pathogenic alternate amino acid;-   generating respective gapped spatial representations for the gapped    protein samples, and generating respective non-gapped spatial    representations for the non-gapped benign protein samples and the    non-gapped pathogenic protein samples;-   training a pathogenicity predictor over one or more training cycles,    and generating a trained pathogenicity predictor, wherein each of    the training cycles uses as training examples gapped spatial    representations from the respective gapped spatial representations,    and non-gapped spatial representations from the respective    non-gapped spatial representations; and-   using the trained pathogenicity classifier to determine    pathogenicity of variants.

Clauses Set 6

-   1. A computer-implemented method of generating training data for    training a variant pathogenicity classifier, including:-   accessing multitude of amino acid positions in a proteome with a    plurality of proteins;-   specifying major allele amino acids at the multitude of amino acid    positions as reference amino acids of the plurality of proteins;-   for each amino acid position in the multitude of amino acids    positions,    -   classifying those nucleotide substitutions as benign variants        that substitute a particular reference amino acid with the        particular reference amino acid at a particular amino acid        position in a particular alternate representation of a        particular protein, and    -   classifying those nucleotide substitutions as pathogenic        variants that substitute the particular reference amino acid        with alternate amino acids at the particular amino acid        position, wherein the alternate amino acids are different from        the particular reference amino acid; and-   training a variant pathogenicity classifier on training data    comprising spatial representations of protein samples, such that the    spatial representations are assigned ground truth benign labels that    correspond to the benign variants, and ground truth pathogenic    labels that correspond to the pathogenic variants.-   2. The computer-implemented method of clause 1, wherein the variant    pathogenicity classifier is trained to determine whether a    substitution of a first amino acid with a second amino acid at a    given amino acid position in a protein is pathogenic or benign.-   3. The computer-implemented method of clause 2, wherein the variant    pathogenicity classifier is trained to generate a pathogenicity    score for the substitution.-   4. The computer-implemented method of clause 1, wherein the variant    pathogenicity classifier is trained to determine whether respective    substitutions of a first amino acid with respective amino acids at a    given amino acid position in a protein are pathogenic or benign.-   5. The computer-implemented method of clause 4, wherein the variant    pathogenicity classifier is trained to generate respective    pathogenicity scores for the respective substitutions.-   6. The computer-implemented method of clause 5, wherein the    respective amino acids correspond to respective twenty    naturally-occurring amino acids.-   7. The computer-implemented method of clause 6, wherein the    respective amino acids correspond to respective naturally-occurring    amino acids from a subset of the twenty naturally-occurring amino    acids.-   8. The computer-implemented method of clause 1, wherein the variant    pathogenicity classifier is trained to determine whether an    insertion of an amino acid at a given vacant amino acid position in    a protein is pathogenic or benign.-   9. The computer-implemented method of clause 8, wherein the variant    pathogenicity classifier is trained to generate a pathogenicity    score for the insertion.-   10. The computer-implemented method of clause 1, wherein the variant    pathogenicity classifier is trained to determine whether respective    insertions of respective amino acids at a given vacant amino acid    position in a protein are pathogenic or benign.-   11. The computer-implemented method of clause 10, wherein the    variant pathogenicity classifier is trained to generate respective    pathogenicity scores for the respective insertions.-   12. The computer-implemented method of clause 11, wherein the    respective amino acids correspond to respective twenty    naturally-occurring amino acids.-   13. The computer-implemented method of clause 12, wherein the    respective amino acids correspond to respective naturally-occurring    amino acids from a subset of the twenty naturally-occurring amino    acids.-   14. The computer-implemented method of clause 1, wherein the variant    pathogenicity classifier is trained to determine whether a    substitution of a first amino acid with a second amino acid at a    given amino acid position in a protein is spatially tolerated by    other amino acids of the protein or not.-   15. The computer-implemented method of clause 14, wherein the    variant pathogenicity classifier is trained to generate a spatial    tolerance score for the substitution.-   16. The computer-implemented method of clause 1, wherein the variant    pathogenicity classifier is trained to determine whether respective    substitutions of a first amino acid with respective amino acids at a    given amino acid position in a protein are spatially tolerated by    other amino acids of the protein or not.-   17. The computer-implemented method of clause 16, wherein the    variant pathogenicity classifier is trained to generate respective    spatial tolerance scores for the respective substitutions.-   18. The computer-implemented method of clause 17, wherein the    respective amino acids correspond to respective twenty    naturally-occurring amino acids.-   19. The computer-implemented method of clause 18, wherein the    respective amino acids correspond to respective naturally-occurring    amino acids from a subset of the twenty naturally-occurring amino    acids.-   20. The computer-implemented method of clause 1, wherein the variant    pathogenicity classifier is trained to determine whether an    insertion of an amino acid at a given vacant amino acid position in    a protein is spatially tolerated by other amino acids of the protein    or not.-   21. The computer-implemented method of clause 20, wherein the    variant pathogenicity classifier is trained to generate a spatial    tolerance score for the insertion.-   22. The computer-implemented method of clause 1, wherein the variant    pathogenicity classifier is trained to determine whether respective    insertions of respective amino acids at a given vacant amino acid    position in a protein are spatially tolerated by other amino acids    of the protein or not.-   23. The computer-implemented method of clause 22, wherein the    variant pathogenicity classifier is trained to generate respective    spatial tolerance scores for the respective insertions.-   24. The computer-implemented method of clause 23, wherein the    respective amino acids correspond to respective twenty    naturally-occurring amino acids.-   25. The computer-implemented method of clause 24, wherein the    respective amino acids correspond to respective naturally-occurring    amino acids from a subset of the twenty naturally-occurring amino    acids.-   26. The computer-implemented method of clause 1, wherein the variant    pathogenicity classifier is trained to determine whether a    substitution of a first amino acid with a second amino acid at a    given amino acid position in a protein is evolutionary conserved or    non-conserved.-   27. The computer-implemented method of clause 26, wherein the    variant pathogenicity classifier is trained to generate an    evolutionary conservation score for the substitution.-   28. The computer-implemented method of clause 1, wherein the variant    pathogenicity classifier is trained to determine whether respective    substitutions of a first amino acid with respective amino acids at a    given amino acid position in a protein are evolutionary conserved or    non-conserved.-   29. The computer-implemented method of clause 28, wherein the    variant pathogenicity classifier is trained to generate respective    evolutionary conservation scores for the respective substitutions.-   30. The computer-implemented method of clause 29, wherein the    respective amino acids correspond to respective twenty    naturally-occurring amino acids.-   31. The computer-implemented method of clause 30, wherein the    respective amino acids correspond to respective naturally-occurring    amino acids from a subset of the twenty naturally-occurring amino    acids.-   32. The computer-implemented method of clause 1, wherein the variant    pathogenicity classifier is trained to determine whether an    insertion of an amino acid at a given vacant amino acid position in    a protein is evolutionary conserved or non-conserved.-   33. The computer-implemented method of clause 32, wherein the    variant pathogenicity classifier is trained to generate an    evolutionary conservation score for the insertion.-   34. The computer-implemented method of clause 1, wherein the variant    pathogenicity classifier is trained to determine whether respective    insertions of respective amino acids at a given vacant amino acid    position in a protein are evolutionary conserved or non-conserved.-   35. The computer-implemented method of clause 34, wherein the    variant pathogenicity classifier is trained to generate respective    evolutionary conservation scores for the respective insertions.-   36. The computer-implemented method of clause 35, wherein the    respective amino acids correspond to respective twenty    naturally-occurring amino acids.-   37. The computer-implemented method of clause 36, wherein the    respective amino acids correspond to respective naturally-occurring    amino acids from a subset of the twenty naturally-occurring amino    acids.-   38. The computer-implemented method of clause 14, wherein spatial    tolerance corresponds to structural tolerance, and spatial    intolerance corresponds to structural intolerance.-   39. The computer-implemented method of clause 1, wherein the    multitude of amino acids positions range from one million to ten    million amino acid positions.-   40. The computer-implemented method of clause 1, wherein the    multitude of amino acids positions range from ten million to hundred    million amino acid positions.-   41. The computer-implemented method of clause 1, wherein the    multitude of amino acids positions range from hundred million to a    billion amino acid positions.-   42. The computer-implemented method of clause 1, wherein the    multitude of amino acids positions range from one to a million amino    acid positions.-   43. The computer-implemented method of clause 1, wherein those    unreachable alternate amino acid classes that are confined by    reachability of single nucleotide polymorphisms (SNPs) to transform    a reference codon of a reference amino acid into alternate amino    acids of the unreachable alternate amino acid classes are masked in    ground truth labels.-   44. The computer-implemented method of clause 1, wherein masked    amino acid classes result in zero loss and do not contribute to    gradient updates.-   45. The computer-implemented method of clause 44, wherein the masked    amino acid classes are identified in a lookup table.-   46. The computer-implemented method of clause 45, wherein the lookup    table identifies a set of masked amino acids classes for each    reference amino acid position.-   47. The computer-implemented method of clause 1, wherein the spatial    representations are structural representations of protein structures    of the protein samples.-   48. The computer-implemented method of clause 1, wherein the spatial    representations are encoded using voxelization

Clauses Set 7

-   1. A computer-implemented method of determining pathogenicity of    nucleotide variants, including:-   accessing a spatial representation of a protein, wherein the spatial    representation of the protein specifies respective spatial    configurations of respective amino acids at respective positions in    the protein;-   removing, from the spatial representation of the protein, a    particular spatial configuration of a particular amino acid at a    particular position, thereby generating a gapped spatial    representation of the protein; and-   determining a pathogenicity of a nucleotide variant based at least    in part on    -   the gapped spatial representation, and    -   a representation of an alternate amino acid created by the        nucleotide variant at the particular position.-   2. The computer-implemented method of clause 1, wherein the removal    of the particular spatial configuration is implemented by a script.-   3. A computer-implemented method of determining pathogenicity of    nucleotide variants, including:-   removing, from a protein, a particular amino acid at a particular    position, thereby generating a gapped protein; and-   determining a pathogenicity of a nucleotide variant based at least    in part on the gapped protein and an alternate amino acid created by    the nucleotide variant at the particular position.-   4. The computer-implemented method of clause 3, wherein the removal    of the particular amino acids is implemented by a script.-   5. A system to predict spatial tolerability of amino acid    substitutes, comprising:-   gapping logic configured to remove, from a protein, a particular    amino acid at a particular position, and create an amino acid    vacancy at the particular position in the protein; and-   substitution logic configured to process the protein with the amino    acid vacancy, and score tolerability of substitute amino acids that    are candidates for filling the amino acid vacancy.-   6. The system of clause 5, wherein the substitution logic is further    configured to score the tolerability of the substitute amino acids    based at least in part on structural compatibility between the    substitute amino acids and adjacent amino acids in a neighborhood of    the amino acid vacancy.-   7. A computer-implemented method of determining pathogenicity of    nucleotide variants, including:-   accessing a protein that has respective amino acids at respective    positions;-   specifying a particular amino acid of a particular amino acid class    at a particular position in the protein as a gap amino acid, and    specifying remaining amino acids at remaining positions in the    protein as non-gap amino acids;-   generating a gapped spatial representation of the protein that    -   includes spatial configurations of the non-gap amino acids, and    -   excludes a spatial configuration of the gap amino acid; and-   based at least in part on the gapped spatial representation,    determining a pathogenicity of respective alternate amino acids at    the particular position,    -   wherein the respective alternate amino acids have respective        amino acid classes that are different from the particular amino        acid class.-   8. A system to predict evolutionary conservation of amino acid    substitutes, comprising:-   gapping logic configured to remove, from a protein, a particular    amino acid at a particular position, and create an amino acid    vacancy at the particular position in the protein; and-   substitution logic configured to process the protein with the amino    acid vacancy, and score evolutionary conservation of substitute    amino acids that are candidates for filling the amino acid vacancy.-   9. The system of clause 8, wherein the substitution logic is further    configured to score the evolutionary conservation of the substitute    amino acids based at least in part on structural compatibility    between the substitute amino acids and adjacent amino acids in a    neighborhood of the amino acid vacancy.-   10. The system of clause 8, wherein the evolutionary conservation is    scored using evolutionary conservation frequencies.-   11. The system of clause 10, wherein the evolutionary conservation    frequencies are based on a position-specific frequency matrix    (PSFM).-   12. The system of clause 10, wherein the evolutionary conservation    frequencies are based on a position-specific scoring matrix (PSSM).-   13. The system of clause 8, wherein evolutionary conservation scores    of the substitute amino acids are rank-ordered by magnitude.-   14. A system to predict evolutionary conservation of amino acid    substitutes, comprising:-   gapping logic configured to remove, from a protein, a particular    amino acid at a particular position, and create an amino acid    vacancy at the particular position in the protein; and-   evolutionary conservation prediction logic configured to process the    protein with the amino acid vacancy, and rank evolutionary    conservation of substitute amino acids that are candidates for    filling the amino acid vacancy.-   15. A system to predict structural tolerability of amino acid    substitutes, comprising:-   gapping logic configured to remove, from a protein, a particular    amino acid at a particular position, and create an amino acid    vacancy at the particular position in the protein; and-   structural tolerability prediction logic configured to process the    protein with the amino acid vacancy, and rank structural    tolerability of substitute amino acids that are candidates for    filling the amino acid vacancy based on amino acid co-occurrence    patterns in a neighborhood of the amino acid vacancy.-   16. A computer-implemented method of determining pathogenicity of    nucleotide variants, including:-   accessing a protein that has respective amino acids at respective    positions;-   specifying a particular amino acid at a particular position in the    protein as a gap amino acid, and specifying remaining amino acids at    remaining positions in the protein as non-gap amino acids;-   generating a gapped spatial representation of the protein that    -   includes spatial configurations of the non-gap amino acids, and    -   excludes a spatial configuration of the gap amino acid;-   determining an evolutionary conservation of an alternate amino acid    at the particular position based at least in part on    -   the gapped spatial representation, and    -   a representation of the alternate amino acid; and-   determining a pathogenicity of a nucleotide variant that creates the    alternate amino acid based at least in part on the evolutionary    conservation.

Clauses Set 8

-   1. A computer-implemented method of determining pathogenicity of    nucleotide variants, including:-   accessing a spatial representation of a protein, wherein the spatial    representation of the protein specifies respective spatial    configurations of respective amino acids at respective positions in    the protein;-   removing, from the spatial representation of the protein, a    particular spatial configuration of a particular amino acid at a    particular position, thereby generating a gapped spatial    representation of the protein; and-   determining a pathogenicity of a nucleotide variant based at least    in part on    -   the gapped spatial representation, and    -   a representation of an alternate amino acid created by the        nucleotide variant at the particular position.-   2. The computer-implemented method of clause 1, wherein the removal    of the particular spatial configuration is implemented by a script.-   3. A computer-implemented method of determining pathogenicity of    nucleotide variants, including:-   removing, from a spatial representation of protein, a particular    amino acid at a particular position, thereby generating a gapped    spatial representation of the protein; and-   determining a pathogenicity of a nucleotide variant based at least    in part on the gapped spatial representation of the protein and an    alternate amino acid created by the nucleotide variant at the    particular position.-   4. The computer-implemented method of clause 3, wherein the removal    of the particular amino acids is implemented by a script.-   5. A system to predict spatial tolerability of amino acid    substitutes, comprising:-   gapping logic configured to remove, from a spatial representation of    a protein, a particular amino acid at a particular position, and    create an amino acid vacancy at the particular position in the    spatial representation of the protein; and-   substitution logic configured to process the spatial representation    of the protein with the amino acid vacancy, and score tolerability    of substitute amino acids that are candidates for filling the amino    acid vacancy.-   6. The system of clause 5, wherein the substitution logic is further    configured to score the tolerability of the substitute amino acids    based at least in part on structural compatibility between the    substitute amino acids and adjacent amino acids in a neighborhood of    the amino acid vacancy.-   7. A computer-implemented method of determining pathogenicity of    nucleotide variants, including:-   accessing a protein that has respective amino acids at respective    positions;-   specifying a particular amino acid of a particular amino acid class    at a particular position in the protein as a gap amino acid, and    specifying remaining amino acids at remaining positions in the    protein as non-gap amino acids;-   generating a gapped spatial representation of the protein that    -   includes spatial configurations of the non-gap amino acids, and    -   excludes a spatial configuration of the gap amino acid; and-   based at least in part on the gapped spatial representation,    determining a pathogenicity of respective alternate amino acids at    the particular position,    -   wherein the respective alternate amino acids have respective        amino acid classes that are different from the particular amino        acid class.-   8. A system to predict evolutionary conservation of amino acid    substitutes, comprising:-   gapping logic configured to remove, from a spatial representation of    a protein, a particular amino acid at a particular position, and    create an amino acid vacancy at the particular position in the    spatial representation of the protein; and-   substitution logic configured to process the spatial representation    of the protein with the amino acid vacancy, and score evolutionary    conservation of substitute amino acids that are candidates for    filling the amino acid vacancy.-   9. The system of clause 8, wherein the substitution logic is further    configured to score the evolutionary conservation of the substitute    amino acids based at least in part on structural compatibility    between the substitute amino acids and adjacent amino acids in a    neighborhood of the amino acid vacancy.-   10. The system of clause 8, wherein the evolutionary conservation is    scored using evolutionary conservation frequencies.-   11. The system of clause 10, wherein the evolutionary conservation    frequencies are based on a position-specific frequency matrix    (PSFM).-   12. The system of clause 10, wherein the evolutionary conservation    frequencies are based on a position-specific scoring matrix (PSSM).-   13. The system of clause 8, wherein evolutionary conservation scores    of the substitute amino acids are rank-ordered by magnitude.-   14. A system to predict evolutionary conservation of amino acid    substitutes, comprising:-   gapping logic configured to remove, from a spatial representation of    a protein, a particular amino acid at a particular position, and    create an amino acid vacancy at the particular position in the    spatial representation of the protein; and-   evolutionary conservation prediction logic configured to process the    spatial representation of the protein with the amino acid vacancy,    and rank evolutionary conservation of substitute amino acids that    are candidates for filling the amino acid vacancy.-   15. A system to predict structural tolerability of amino acid    substitutes, comprising:-   gapping logic configured to remove, from a spatial representation of    a protein, a particular amino acid at a particular position, and    create an amino acid vacancy at the particular position in the    spatial representation of the protein; and-   structural tolerability prediction logic configured to process the    spatial representation of the protein with the amino acid vacancy,    and rank structural tolerability of substitute amino acids that are    candidates for filling the amino acid vacancy based on amino acid    co-occurrence patterns in a neighborhood of the amino acid vacancy.-   16. A computer-implemented method of determining pathogenicity of    nucleotide variants, including:-   accessing a protein that has respective amino acids at respective    positions;-   specifying a particular amino acid at a particular position in the    protein as a gap amino acid, and specifying remaining amino acids at    remaining positions in the protein as non-gap amino acids;-   generating a gapped spatial representation of the protein that    -   includes spatial configurations of the non-gap amino acids, and    -   excludes a spatial configuration of the gap amino acid;-   determining an evolutionary conservation of an alternate amino acid    at the particular position based at least in part on    -   the gapped spatial representation, and    -   a representation of the alternate amino acid; and-   determining a pathogenicity of a nucleotide variant that creates the    alternate amino acid based at least in part on the evolutionary    conservation.

While the present invention is disclosed by reference to the preferredimplementations and examples detailed above, it is to be understood thatthese examples are intended in an illustrative rather than in a limitingsense. It is contemplated that modifications and combinations willreadily occur to those skilled in the art, which modifications andcombinations will be within the spirit of the invention and the scope ofthe following claims

What is claimed is:
 1. A computer-implemented method of training apathogenicity predictor, including: accessing a gapped training set thatincludes respective gapped protein samples for respective positions in aproteome; accessing a non-gapped training set that includes non-gappedbenign protein samples and non-gapped pathogenic protein samples;generating respective gapped spatial representations for the gappedprotein samples, and generating respective non-gapped spatialrepresentations for the non-gapped benign protein samples and thenon-gapped pathogenic protein samples; training a pathogenicitypredictor over one or more training cycles and generating a trainedpathogenicity predictor, wherein each of the training cycles uses astraining examples gapped spatial representations from the respectivegapped spatial representations and non-gapped spatial representationsfrom the respective non-gapped spatial representations; and using thetrained pathogenicity classifier to determine pathogenicity of variants.2. The computer-implemented method of claim 1, wherein the respectivegapped protein samples are labelled with respective gapped ground truthsequences.
 3. The computer-implemented method of claim 2, wherein aparticular gapped ground truth sequence for a particular gapped proteinsample has a benign label for a particular amino acid class thatcorresponds to a reference amino acid at a particular position in theparticular gapped protein.
 4. The computer-implemented method of claim3, wherein the particular gapped protein sample has respectivepathogenic labels for respective remaining amino acid classes thatcorrespond to alternate amino acids at the particular position.
 5. Thecomputer-implemented method of claim 1, wherein a particular non-gappedbenign protein sample includes a benign alternate amino acid at aparticular position substituted by a benign nucleotide variant.
 6. Thecomputer-implemented method of claim 5, wherein a particular non-gappedpathogenic protein sample includes a pathogenic alternate amino acid ata particular position substituted by a pathogenic nucleotide variant. 7.The computer-implemented method of claim 6, wherein the particularnon-gapped benign protein sample is labelled with a benign ground truthsequence that has a benign label for a particular amino acid class thatcorresponds to the benign alternate amino acid.
 8. Thecomputer-implemented method of claim 7, wherein the benign ground truthsequence respective masked labels for respective remaining amino acidclasses that correspond to amino acids that are different from thebenign alternate amino acid.
 9. The computer-implemented method of claim8, wherein the particular non-gapped pathogenic protein sample islabelled with a pathogenic ground truth sequence that has a pathogeniclabel for a particular amino acid class that corresponds to thepathogenic alternate amino acid.
 10. The computer-implemented method ofclaim 9, wherein the pathogenic ground truth sequence has respectivemasked labels for respective remaining amino acid classes thatcorrespond to amino acids that are different from the pathogenicalternate amino acid.
 11. The computer-implemented method of claim 1,further including using a sample indicator to indicate to thepathogenicity predictor whether a current training example is a gappedspatial representation for a gapped protein sample, or a non-gappedspatial representation for a non-gapped protein sample.
 12. Thecomputer-implemented method of claim 3, further including masking thebenign label for the particular amino acid class that corresponds to thereference amino acid at the particular position in the particular gappedprotein.
 13. The computer-implemented method of claim 1, wherein thenon-gapped benign protein samples are derived from common human andnon-human primate nucleotide variants.
 14. The computer-implementedmethod of claim 1, wherein the non-gapped pathogenic protein samples arederived from combinatorically simulated nucleotide variants.
 15. Thecomputer-implemented method of claim 1, wherein the pathogenicitypredictor generates an amino acid class-wise output sequence in responseto processing a training example, wherein the amino acid class-wiseoutput sequence has amino acid class-wise pathogenicity scores.
 16. Thecomputer-implemented method of claim 1, further including measuringperformance of the trained pathogenicity predictor between trainingcycles over a validation set.
 17. The computer-implemented method ofclaim 16, wherein the validation set includes a pair of gapped andnon-gapped spatial representations for each held-out protein sample. 18.The computer-implemented method of claim 17, wherein the trainedpathogenicity predictor generates a first amino acid class-wise outputsequence for the gapped spatial representation in the pair, and a secondamino acid class-wise output sequence for the non-gapped spatialrepresentation in the pair, wherein a final pathogenicity score for anucleotide variant that causes an amino acid substitution in a held-outprotein sample is determined based on a combination of first and secondpathogenicity scores for the amino acid substitution in the first andsecond amino acid class-wise output sequences.
 19. Thecomputer-implemented method of claim 18, wherein the final pathogenicityscore is based on an average of the first and second pathogenicityscores.
 20. The computer-implemented method of claim 1, wherein at leastsome of the training cycles use a same of number of gapped spatialrepresentations and non-gapped spatial representations.
 21. Thecomputer-implemented method of claim 1, wherein at least some of thetraining cycles use batches of training examples that have a same ofnumber of gapped spatial representations and non-gapped spatialrepresentations.
 22. The computer-implemented method of claim 1, whereina masked label does not contribute to error determination, and thereforedoes not contribute to training of the pathogenicity predictor.
 23. Thecomputer-implemented method of claim 22, wherein the masked label iszeroed-out.
 24. The computer-implemented method of claim 1, wherein thegapped spatial representations are weighted differently from thenon-gapped spatial representations, such that a contribution of thegapped spatial representations to gradient updates applied to parametersof the pathogenicity predictor in response to the pathogenicitypredictor processing the non-gapped spatial representations varies froma contribution of the non-gapped spatial representations to gradientupdates applied to the parameters of the pathogenicity predictor inresponse to the pathogenicity predictor processing the non-gappedspatial representations.
 25. The computer-implemented method of claim24, wherein the variation is determined by pre-defined weights.
 26. Acomputer-implemented method of training a pathogenicity predictor,including: starting with training a pathogenicity classifier on a gappedtraining set and generating a trained pathogenicity classifier; furthertraining the trained pathogenicity classifier on a non-gapped trainingset and generating a retrained pathogenicity classifier; and using theretrained pathogenicity classifier to determine pathogenicity ofvariants.
 27. The computer-implemented method of claim 26, furtherincluding measuring performance of the trained pathogenicity predictorbetween training cycles over a first validation set that includes onlynon-gapped spatial representations of held-out protein samples.
 28. Thecomputer-implemented method of claim 27, further including measuringperformance of the retrained pathogenicity predictor between trainingcycles over a second validation set that includes gapped spatialrepresentations and non-gapped spatial representations of held-outprotein samples.
 29. The computer-implemented method of claim 28,wherein the retrained pathogenicity predictor generates a first aminoacid class-wise output sequence for the pair in response to processingthe pair, wherein a final pathogenicity score for a nucleotide variantthat causes an amino acid substitution in a corresponding held-outprotein sample is determined based on the first amino acid class-wiseoutput sequence.
 30. A computer-implemented method of training apathogenicity predictor, including: accessing a gapped training set thatincludes respective gapped protein samples for respective positions in aproteome, wherein the respective gapped protein samples are labelledwith respective gapped ground truth sequences, wherein a particulargapped ground truth sequence for a particular gapped protein sample hasa benign label for a particular amino acid class that corresponds to areference amino acid at a particular position in the particular gappedprotein, and has respective pathogenic labels for respective remainingamino acid classes that correspond to alternate amino acids at theparticular position; accessing a non-gapped training set that includesnon-gapped benign protein samples and non-gapped pathogenic proteinsamples, wherein a particular non-gapped benign protein sample includesa benign alternate amino acid at a particular position substituted by abenign nucleotide variant, wherein a particular non-gapped pathogenicprotein sample includes a pathogenic alternate amino acid at aparticular position substituted by a pathogenic nucleotide variant,wherein the particular non-gapped benign protein sample is labelled witha benign ground truth sequence that has a benign label for a particularamino acid class that corresponds to the benign alternate amino acid,and respective masked labels for respective remaining amino acid classesthat correspond to amino acids that are different from the benignalternate amino acid, and wherein the particular non-gapped pathogenicprotein sample is labelled with a pathogenic ground truth sequence thathas a pathogenic label for a particular amino acid class thatcorresponds to the pathogenic alternate amino acid, and respectivemasked labels for respective remaining amino acid classes thatcorrespond to amino acids that are different from the pathogenicalternate amino acid; generating respective gapped spatialrepresentations for the gapped protein samples, and generatingrespective non-gapped spatial representations for the non-gapped benignprotein samples and the non-gapped pathogenic protein samples; traininga pathogenicity predictor over one or more training cycles, andgenerating a trained pathogenicity predictor, wherein each of thetraining cycles uses as training examples gapped spatial representationsfrom the respective gapped spatial representations, and non-gappedspatial representations from the respective non-gapped spatialrepresentations; and using the trained pathogenicity classifier todetermine pathogenicity of variants.